Code
!pip install sidetable -q
By Sasha Fridman
We aim to reveal key drivers of sales and revenues of our online store.
While the business has overall proven to be profitable, there is a need to identify products’ characteristics and sales patterns that contribute significantly to business growth, as well as those that may have a negative impact.
The dataset contains sales entries of an online store that sells household goods.
The file ecommerce_dataset_us.csv
contains the following columns:
InvoiceNo
— order identifierStockCode
— item identifierDescription
— item nameQuantity
— quantity of itemsInvoiceDate
— order dateUnitPrice
— price per itemCustomerID
— customer identifierTransaction-related terms
“Entry” (or “purchase”) - represents a single line in our dataset - one specific product being bought. While technically these are “entries” in our data, we often use the word “purchase” in more natural contexts. Each entry includes details like stock code, quantity, unit price, and invoice number.
“Invoice” (or “order”) - a group of entries representing a single transaction. An invoice can contain one or several entries (commonly, different products) purchased by the same customer at the same time.
In essence, each invoice represents a complete order, while entries show us purchases of individual products within that order. Technically (assuming no missing invoice numbers), counting unique invoice numbers (“nunique”) gives us the total number of orders, while counting all invoice entries (“count”) gives us the total number of individual product purchases.
“Mutually exclusive entries” - these are pairs of entries where a customer makes and then returns the same purchase, with matching quantity, price, and stock code, but opposite signs for quantity and revenue. Some return scenarios (like partial returns or price differences) may not be captured by this definition. We have developed an approach for handling such cases, which will be explained and applied later in the Distribution Analysis section of the project.
“Returns” - are defined as negative quantity entries from mutually exclusive entries. The overall return volume might be slightly larger, as some returns could have been processed outside our defined return identification rules (for example, when a customer buys and returns the same product but at a different price or quantity).
“Operation” (or “operational entry”) - an entry that represents non-product sales activity, like delivery, marketplace-related entries, service charges, or inventory adjustments (description examples: “POSTAGE”, “Amazon Adjustment”, “Bank Charges”, “damages”). We will analyze these cases and their impact, but exclude them from our product range analysis when they add noise without meaningful insights.
General terms
“Sales volume” (or “purchases volume”) - we will use these terms to refer to quantity of units sold, not revenue generated from purchases.
“Wholesale purchases” - are defined as entries (individual product purchases) where the quantity falls within the top 5% of all entries.
“High-volume products” - are defined as products whose purchases volume (sum of quantities across all entries) falls within the top 5% of all products.
“High-volume customers” - are defined as customers whose purchases volume (sum of quantities across all entries) falls within the top 5% of all customers.
“Expensive products” - are defined as products whose *median unit price per entry falls within the top 5% of all products’ median unit prices.
“Cheap products” - are defined as products whose *median unit price per entry falls within the bottom 5% of all products’ median unit prices.
“New products” - are defined as products that experienced sales in the last three months of our dataset, but never before.
*Note: Here we use medians, since they better than means represent typical values for non-normal distributions, that has been proven to be the case in our study.
“IQR (Interquartile Range)” - the range between the first quartile (25th percentile) and third quartile (75th percentile) of the data. In our analysis, we will primarily use IQR for outliers detection.
💡 - An important insight relevant to this specific part of the study.
💡💡 - A key insight with significant implications for the entire project.
⚠ - Information requiring special attention (e.g., major clarifications or decision explanations), as it may impact further analysis.
Additional clarifications with more local relevance are preceded by the bold word “Note” and/or highlighted in italics.
!pip install sidetable -q
# data manipulation libraries
import pandas as pd
import numpy as np
import scipy.stats as stats
import sidetable
# date and time handling
from datetime import datetime, timedelta
import calendar
# visualization libraries
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib.ticker import ScalarFormatter, EngFormatter
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# statistical and language processing libraries
import math
import re
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
# Matplotlib and Seaborn visualization configuration
'seaborn-v0_8') # more attractive styling
plt.style.use(
plt.rcParams.update({'figure.figsize': (12, 7),
'grid.alpha': 0.5,
'grid.linestyle': '--',
'font.size': 10,
'axes.titlesize': 14,
'axes.labelsize': 10})
="whitegrid", palette="deep")
sns.set_theme(style
# Pandas display options
'display.max_columns', None)
pd.set_option(= 150
table_width 'display.width', table_width)
pd.set_option(= 40
col_width 'display.max_colwidth', col_width)
pd.set_option(#pd.set_option('display.precision', 2)
'display.float_format', '{:.2f}'.format) # displaying normal numbers instead of scientific notation
pd.set_option(
# Python and Jupyter/IPython utility libraries and settings
import warnings
'ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings(
from IPython.core.interactiveshell import InteractiveShell
= 'all' # notebook enhanced output
InteractiveShell.ast_node_interactivity from IPython.display import display, HTML, Markdown # broader options for text formatting and displaying
import textwrap # for formatting and wrapping text (e.g. to manage long strings in outputs)
# loading the data file into a DataFrame
try:
= pd.read_csv('C:/Users/4from/Desktop/Practicum/13. Final project/datasets/ecommerce_dataset_us.csv', sep='\t')
df_ecom
except:
= pd.read_csv('/datasets/ecommerce_dataset_us.csv', sep='\t') df_ecom
Let’s enhance efficiency of our further analysis by creating two functions: get_df_name
and data_inspection
.
Function: get_df_name
The get_df_name
function retrieves and returns the name of a DataFrame variable as a string, what will be handy for displaying information explicitly by other functions.
def get_df_name(df):
"""
The function returns the user-defined name of the DataFrame variable as a string.
Input: the DataFrame whose name must be extracted.
Output: the name of the DataFrame.
"""
for name, value in globals().items():
if value is df:
if not name.startswith('_'): # excluding internal names
return name
return "name not found"
Function: data_inspection
The data_inspection
function performs comprehensive inspections of a given DataFrame. It provides insights into the dataset’s structure, including concise summaries, examples, descriptive statistics, categorical parameter statistics, missing values, and duplicates.
def data_inspection(df, show_example=True, example_type='head', example_limit=5, frame_len=120):
"""
The function performs various data inspections on a given DataFrame.
As input it takes:
- df: a DataFrame to be evaluated.
- show_example (bool, optional): whether to display examples of the DataFrame. By default - True.
- example_type (str, optional): type of examples to display ('sample', 'head', 'tail'). By default - 'head'.
- example_limit (int, optional): maximum number of examples to display. By default - 5.
- frame_len (int, optional): the length of frame of printed outputs. Default - 40.
- frame_len (int, optional): the length of frame of printed outputs. Default - 40. If `show_example` is True, frame_len is set to minimum of the values: manually set `frame_len` and `table_width (which is defined at the project initiation stage).
As output it presents:
- Displays concise summary.
- Displays examples of the `df` DataFrame (if `show_example` is True)
- Displays descriptive statistics.
- Displays descriptive statistics for categorical parameters.
- Displays information on missing values.
- Displays information on dublicates.
"""
# adjusting output frame; "table_width" is set at project initiation stage
= min(table_width, frame_len) if show_example else frame_len
frame_len
# retrieving a name of the DataFrame
= get_df_name(df)
df_name
# calculating figures on duplicates
= df.duplicated().sum()
dupl_number = round(df.duplicated().mean()*100, 1)
dupl_share
# displaying information about the DataFrame
print('='*frame_len)
f'**Overview of `{df_name}`:**'))
display(Markdown(print('-'*frame_len)
print(f'\033[1mConcise summary:\033[0m')
print(df.info(), '\n')
if show_example:
print('-'*frame_len)
= {'sample': 'Random examples', 'head': 'Top rows', 'tail': 'Bottom rows'}
example_messages = {'sample': df.sample, 'head': df.head, 'tail': df.tail}
example_methods = example_messages.get(example_type)
message = example_methods.get(example_type)
method print(f'\033[1m{message}:\033[0m')
print(method(min(example_limit, len(df))), '\n')
print('-'*frame_len)
print(f'\033[1mDescriptive statistics:\033[0m')
print(df.describe(), '\n')
print('-'*frame_len)
print(f'\033[1mDescriptive statistics of categorical parameters:\033[0m')
print(df.describe(include=['object']), '\n') # printing descriptive statistics for categorical parameters
print('-'*frame_len)
print(f'\033[1mMissing values:\033[0m')
=True))
display(df.stb.missing(style
print('-'*frame_len)
print(f'\033[1mNumber of duplicates\033[0m: {dupl_number} ({dupl_share :.1f}% of all entries)\n')
print('='*frame_len)
=True, example_type='sample', example_limit=5) data_inspection(df_ecom, show_example
========================================================================================================================
Overview of df_ecom
:
------------------------------------------------------------------------------------------------------------------------
Concise summary:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 InvoiceNo 541909 non-null object
1 StockCode 541909 non-null object
2 Description 540455 non-null object
3 Quantity 541909 non-null int64
4 InvoiceDate 541909 non-null object
5 UnitPrice 541909 non-null float64
6 CustomerID 406829 non-null float64
dtypes: float64(2), int64(1), object(4)
memory usage: 28.9+ MB
None
------------------------------------------------------------------------------------------------------------------------
Random examples:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID
189843 553167 22417 PACK OF 60 SPACEBOY CAKE CASES 1 05/11/2019 16:19 2.08 NaN
60555 541422 22342 HOME GARLAND PAINTED ZINC 3 01/15/2019 17:48 1.63 NaN
52189 540691 82583 HOT BATHS METAL SIGN 12 01/09/2019 08:50 2.21 17450.00
452762 575384 22910 PAPER CHAIN KIT VINTAGE CHRISTMAS 12 11/07/2019 15:17 2.95 17690.00
25836 538417 22791 T-LIGHT GLASS FLUTED ANTIQUE 10 12/10/2018 11:54 1.25 16393.00
------------------------------------------------------------------------------------------------------------------------
Descriptive statistics:
Quantity UnitPrice CustomerID
count 541909.00 541909.00 406829.00
mean 9.55 4.61 15287.69
std 218.08 96.76 1713.60
min -80995.00 -11062.06 12346.00
25% 1.00 1.25 13953.00
50% 3.00 2.08 15152.00
75% 10.00 4.13 16791.00
max 80995.00 38970.00 18287.00
------------------------------------------------------------------------------------------------------------------------
Descriptive statistics of categorical parameters:
InvoiceNo StockCode Description InvoiceDate
count 541909 541909 540455 541909
unique 25900 4070 4223 23260
top 573585 85123A WHITE HANGING HEART T-LIGHT HOLDER 10/29/2019 14:41
freq 1114 2313 2369 1114
------------------------------------------------------------------------------------------------------------------------
Missing values:
missing | total | percent | |
---|---|---|---|
CustomerID | 135,080 | 541,909 | 24.93% |
Description | 1,454 | 541,909 | 0.27% |
InvoiceNo | 0 | 541,909 | 0.00% |
StockCode | 0 | 541,909 | 0.00% |
Quantity | 0 | 541,909 | 0.00% |
InvoiceDate | 0 | 541,909 | 0.00% |
UnitPrice | 0 | 541,909 | 0.00% |
------------------------------------------------------------------------------------------------------------------------
Number of duplicates: 5268 (1.0% of all entries)
========================================================================================================================
# checking the dataset scope
= ['CustomerID', 'Description', 'StockCode', 'InvoiceNo']
columns = pd.to_datetime(df_ecom['InvoiceDate']).min().date()
first_invoice_day = pd.to_datetime(df_ecom['InvoiceDate']).max().date()
last_invoice_day = (last_invoice_day - first_invoice_day).days
total_period
print('='*60)
f'**The scope of `df_ecom`:**'))
display(Markdown(print('-'*60)
print(f'\033[1mNumber of unique values:\033[0m')
for column in columns:
print(f' \033[1m`{column}`\033[0m - {df_ecom[column].nunique()}')
print('-'*60)
print(f'\033[1mEntries (purchases) per invoice:\033[0m\
mean - {df_ecom.groupby("InvoiceNo").size().mean() :0.1f},\
median - {df_ecom.groupby("InvoiceNo").size().median() :0.1f}')
print(f'\033[1mInvoices (orders) per customer:\033[0m\
mean - {df_ecom.groupby("CustomerID")["InvoiceNo"].nunique().mean() :0.1f},\
median - {df_ecom.groupby("CustomerID")["InvoiceNo"].nunique().median() :0.1f}')
print('-'*60)
print(f'\033[1mOverall period:\033[0m\
{first_invoice_day} - {last_invoice_day}, {total_period} days in total')
print('='*60)
============================================================
The scope of df_ecom
:
------------------------------------------------------------
Number of unique values:
`CustomerID` - 4372
`Description` - 4223
`StockCode` - 4070
`InvoiceNo` - 25900
------------------------------------------------------------
Entries (purchases) per invoice: mean - 20.9, median - 10.0
Invoices (orders) per customer: mean - 5.1, median - 3.0
------------------------------------------------------------
Overall period: 2018-11-29 - 2019-12-07, 373 days in total
============================================================
Let’s examine temporal consistency of invoices by ensuring each invoice has only one concrete timestamp.
# checking whether all the invoices are associated with an only one timestamp
= df_ecom.groupby('InvoiceNo').agg(
invoices_dates = ('InvoiceDate', 'nunique'),
unique_dates_number = ('InvoiceDate', 'unique')
unique_dates ='unique_dates_number', ascending=False)
).reset_index().sort_values(by
'unique_dates_number'].value_counts()
invoices_dates[
# filtering invoices with multiple timestamps
= invoices_dates.query('unique_dates_number > 1')
invoices_multiple_dates 3) invoices_multiple_dates.sample(
unique_dates_number
1 25857
2 43
Name: count, dtype: int64
InvoiceNo | unique_dates_number | unique_dates | |
---|---|---|---|
6684 | 550320 | 2 | [04/15/2019 12:37, 04/15/2019 12:38] |
10527 | 558086 | 2 | [06/24/2019 11:58, 06/24/2019 11:59] |
2372 | 541596 | 2 | [01/17/2019 16:18, 01/17/2019 16:19] |
# adding a column displaying time difference between timestamps (for rare cases with 2 timestamps, normally there's only 1)
= invoices_multiple_dates.copy() # avoiding SettingWithCopyWarning
invoices_multiple_dates 'days_delta'] = invoices_multiple_dates['unique_dates'].apply(
invoices_multiple_dates[lambda x: pd.to_datetime(x[1]) - pd.to_datetime(x[0]))
# checking the result
3)
invoices_multiple_dates.sample('days_delta'].describe() invoices_multiple_dates[
InvoiceNo | unique_dates_number | unique_dates | days_delta | |
---|---|---|---|---|
2475 | 541849 | 2 | [01/21/2019 13:33, 01/21/2019 13:34] | 0 days 00:01:00 |
8154 | 553199 | 2 | [05/13/2019 15:13, 05/13/2019 15:14] | 0 days 00:01:00 |
4642 | 546388 | 2 | [03/09/2019 13:42, 03/09/2019 13:43] | 0 days 00:01:00 |
count 43
mean 0 days 00:01:00
std 0 days 00:00:00
min 0 days 00:01:00
25% 0 days 00:01:00
50% 0 days 00:01:00
75% 0 days 00:01:00
max 0 days 00:01:00
Name: days_delta, dtype: object
Observations
InvoiceNo
is of an object type. If possible, it should be converted to integer type.InvoiceDate
is of an object type. It should be converted to datetime format.CustomerID
is of a float type. It should be converted to string type (there’s no need for calculations with customer IDs, and keeping them in numeric format may affect further visualizations.)Quantity
and UnitPrice
columns. Further investigation is needed to understand and address these anomalies.CustomerID
column has 25% missing values and the Description
column has 0.3% missing values.Description
) slightly exceeds that of stock codes (StockCode
). It could be an indication of multiple-descriptions under same stock codes, probably non-product related descriptions as well. We will check this phenomenon in our next steps.Let’s enhance efficiency of our further analysis by developing two practical functions: data_reduction
and share_evaluation
. Considering that we will view long names on compact charts in our subsequent study, an extra wrap_text
function will be useful to ensure a neat appearance.
Function: data_reduction
The function simplifies the process of filtering data based on a specified operation. This operation can be any callable function or lambda function that reduces the DataFrame according to specific criteria. The function tells us how many entries were removed and returns the reduced DataFrame.
def data_reduction(df, operation):
"""
The function reduces data based on the specified operation and provides number of cleaned out entries.
As input it takes:
- df (DataFrame): a DataFrame to be reduced.
- operation: a lambda function that performs the reduction operation on the DataFrame.
As output it presents:
- Displays a number of cleaned out entries.
- Returns a reduced DataFrame.
----------------
Example of usage (for excluding entries with negative quantities):
"cleaned_df = data_reduction(innitial_df, lambda df: df.query('quantity >= 0'))"
----------------
"""
= len(df)
entries_before
try:
= operation(df)
reduced_df except Exception as error_message:
print(f"\033[1;31mError during data reduction:\033[0m {error_message}")
return df
= len(reduced_df)
entries_after = entries_before - entries_after
cleaned_out_entries = (entries_before - entries_after) / entries_before * 100
cleaned_out_share
print(f'\033[1mNumber of entries cleaned out from the "{get_df_name(df)}":'
f'\033[0m {cleaned_out_entries} ({cleaned_out_share:0.1f}%)')
return reduced_df
Function: share_evaluation
The function evaluates a share and characteristics of a subset of data compared to an initial dataset. It calculates and presents various metrics such as the percentage of entries, share of quantities and revenues (if applicable), invoice period coverage. It also optionally displays examples of a data subset, as well as pie charts and boxplot visualizations of parameters’ share and distributions. This function helps in understanding of a subset impact within a broader dataset, what is especially useful when it comes to decisions about removing irrelevant data.
def share_evaluation(df, initial_df, title_extension='',
=False,
show_qty_rev=False,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share'},
(=True,
show_pie_charts_notes=False, boxplots_parameter=None, show_outliers=True,
show_boxplots=False,
show_period=False, example_type='sample', random_state=None, example_limit=5,
show_example=table_width):
frame_len"""
This function evaluates the share and characteristics of a data slice compared to an initial dataset.
It calculates and displays the following metrics for a given data slice:
- Percentage of entries relative to the initial dataset.
- Quantity and revenue totals together with their shares (if `show_qty_rev` is True).
- Pie charts of desired paramerers (if 'show_pie_charts' is True).
- Boxplots of `quantity` and `revenue` (if 'show_boxplots' is True).
- Invoice period coverage (if 'show_period' is True).
- Examples of the data slice (if 'show_example' is True).
As input, the function takes:
- df (DataFrame): a data slice to be evaluated.
- initial_df (DataFrame): an original dataset for comparison. Default - `df_ecom`.
- title_extension (str, optional): additional text to append to the summary and plot titles. Default - an empty string.
- show_qty_rev (bool, optional): whether to display the quantity and revenue figures along with their shares. By default - False.
Note: both datasets must contain a 'revenue' column to display this.
..........
- show_pie_charts (bool, optional): whether to display pie charts. Default - False.
Note: `show_qty_rev` must be True to display this.
- pie_chart_parameters (dict, optional): a dictionary specifying parameters for pie chart creation.
Keys are tuples of (column_name, aggregation_function), and values are strings representing chart names.
Format: {(column_name, aggregation_function): 'Chart Name'}
Default: {('quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share'}
- show_pie_charts_notes (bool, optional): whether to display predefined notes for certain pie charts. By default - True.
Notes are available for: 'Quantity Share', 'Revenue Share', Entries Share', 'Invoices Coverage', 'Stock Codes Coverage',
'Descriptions Coverage', 'Products Coverage' and 'Customers Coverage'.
These notes explain the difference between count-based metrics and coverage-based metrics.
..........
- show_boxplots (bool, optional): whether to display boxplots for quantity and revenue distribution. By default, False.
Note: `show_qty_rev` must be True to display this.
- boxplots_parameter (str, optional): an additional categorical variable for the boxplot if needed.
If yes, the column of `df` must be specified. By default - None.
- show_outliers (bool, optional): whether to display outliers in boxplots. True shows them; False hides them. By default - True.
..........
- show_period (bool, optional): whether to display invoice period coverage. By default - False.
Note: both datasets must contain `invoice_day` and `invoice_month` columns to display this.
..........
- show_example (bool, optional): whether to display examples of the data slice. By default - False.
- example_type (str, optional): type of examples to display ('sample', 'head', 'tail'). By default - 'sample'.
- random_state (int, optional): controls the randomness of sample selection. Default - None.
If provided, ensures consistent results across multiple runs. Default - None.
- example_limit (int, optional): maximum number of examples to display. By default - 5.
..........
- frame_len (int, optional): length of the frame for printed outputs. Default - table_width. If `show_pie_charts` or `show_boxplots` is True, `frame_len` is set to `table_width` (which is defined at the project initiation stage). Else if `show_example` is True, takes the minimum value of `table_width` and manually set `frame_len`.
"""
# adjusting output frame width
if show_pie_charts or show_boxplots:
= table_width
frame_len elif show_example:
= min(table_width, frame_len)
frame_len elif show_period:
= min(110, frame_len)
frame_len
# getting DataFrame names
= get_df_name(df) if get_df_name(df) != "name not found" else "the data slice mentioned in the call function"
df_name = get_df_name(initial_df) if get_df_name(initial_df) != "name not found" else "the initial DataFrame"
initial_df_name
# calculating basic statistics
= round(len(df) / len(initial_df) * 100, 1)
share_entries
# adjusting title extension if needed
= f' {title_extension}' if title_extension else ''
title_extension
# printing header
print('='*frame_len)
f'**Evaluation of share: `{df_name}`{title_extension} in `{initial_df_name}`**\n'))
display(Markdown(print('-'*frame_len)
print(f'\033[1mNumber of entries\033[0m: {len(df)} ({share_entries:.1f}% of all entries)\n')
# handling quantity and revenue analysis
if show_qty_rev and ('revenue' not in df.columns or 'quantity' not in initial_df.columns):
print(f'\n\033[1;31mNote\033[0m: For displaying the data on revenues, all datasets must contain the "revenue" column.\n\n'
f'To avoid this message, set: "show_qty_rev=False".')
return
# handling pie-charts and boxplots
if show_qty_rev:
_display_quantity_revenue(df, initial_df) if show_pie_charts and pie_chart_parameters:
_create_pie_charts(df, initial_df, df_name, initial_df_name,
pie_chart_parameters, show_pie_charts_notes, title_extension, frame_len) if show_boxplots:
_create_boxplots(df, df_name, boxplots_parameter, show_outliers, title_extension, frame_len)
# handling period coverage
if show_period:
_display_period_coverage(df, initial_df, frame_len)
# handling examples
if show_example:
_display_examples(df, example_type, example_limit, random_state, frame_len)
print('='*frame_len)
def _display_quantity_revenue(df, initial_df):
"""Helper function to display quantity and revenue statistics."""
= df['quantity'].sum()
quantity = initial_df['quantity'].sum()
total_quantity = abs(quantity / total_quantity) * 100
quantity_share = round(df['revenue'].sum(), 1)
revenue = initial_df['revenue'].sum()
total_revenue = abs(revenue / total_revenue) * 100
revenue_share
print(f'\033[1mQuantity\033[0m: {quantity} ({quantity_share:.1f}% of the total quantity)')
print(f'\033[1mRevenue\033[0m: {revenue} ({revenue_share:.1f}% of the total revenue)')
def _create_pie_charts(df, initial_df, df_name, initial_df_name, pie_chart_parameters, show_pie_charts_notes, title_extension, frame_len):
"""Helper function to create and display pie charts."""
print('-'*frame_len)
# extracting metrics and names from parameters
= []
metrics_order = []
pie_chart_names = {}
agg_dict
for (column, operation), chart_name in pie_chart_parameters.items():
if column not in agg_dict:
= []
agg_dict[column]
agg_dict[column].append(operation)f'{column}_{operation}')
metrics_order.append(
pie_chart_names.append(chart_name)
= initial_df.agg(agg_dict).abs()
total_metrics = df.agg(agg_dict).abs()
slice_metrics
# flattening metrics while preserving order
= []
total_metrics_flat = []
slice_metrics_flat for column in agg_dict:
for operation in agg_dict[column]:
total_metrics_flat.append(total_metrics[column][operation])
slice_metrics_flat.append(slice_metrics[column][operation])
# checking values and creating pie charts
= True
values_check for metric_name, slice_val, total_val in zip(metrics_order, slice_metrics_flat, total_metrics_flat):
if slice_val > total_val:
print(f'\033[1;31mNote\033[0m: Unable to create pie chart as "{metric_name}" in the "{df_name}" ({slice_val:.0f}) exceeds the total "{metric_name}" ({total_val:.0f}) in the "{initial_df_name}".')
= False
values_check
if values_check:
= [100 * slice_metric/total_metric for slice_metric, total_metric in zip(slice_metrics_flat, total_metrics_flat)]
percentages = [100 - percent for percent in percentages]
other_percentages
= {name: [percent, 100-percent]
pie_charts_data for name, percent in zip(pie_chart_names, percentages)}
# plotting pie charts
= len(pie_charts_data)
num_charts = (num_charts + 1) // 2
rows = plt.subplots(rows, 2, figsize=(8, 4*rows))
fig, axs = axs.flatten() if isinstance(axs, np.ndarray) else [axs]
axs
= f'Pie-charts' if len(pie_chart_names) > 1 else f'Pie-chart'
pie_chart_name f'The {pie_chart_name} of "{df_name}"{title_extension} vs Other Data in "{initial_df_name}"', fontsize=13, fontweight='bold', y=1)
fig.suptitle(
= sns.color_palette('pastel')
colors
for i, (metric, values) in enumerate(pie_charts_data.items()):
= axs[i]
ax
= [wrap_text(name, 25) for name in [df_name, 'Other Data']] # wrapping pie charts labels, if needed
wrapped_names =wrapped_names, autopct='%1.1f%%', startangle=90, colors=colors)
ax.pie(values, labelsf'{metric}', fontsize=12, y=1.02, fontweight='bold')
ax.set_title(
# removing unused subplots
for i in range(num_charts, len(axs)):
fig.delaxes(axs[i])
plt.tight_layout();
plt.show()
# displaying predefined notes for pie charts if needed
if show_pie_charts_notes and pie_chart_parameters:
= display_pie_charts_notes(pie_chart_parameters.values(), df_name, initial_df_name)
notes_to_display = ''
notes_to_display_content for note in notes_to_display.values():
+= note + '\n'
notes_to_display_content
# creating collapsible section with notes
= f'''
notes_html <details>
<summary style="color: navy; cursor: pointer;"><b><i>Click to view pie chart explanations</i></b></summary>
<p>
<ul>
{notes_to_display_content}
</ul>
</p>
</details>
'''
display(HTML(notes_html))
def _create_boxplots(df, df_name, boxplots_parameter, show_outliers, title_extension, frame_len):
"""Helper function to create and display boxplots."""
print('-'*frame_len)
=None
paletteif boxplots_parameter:
='pastel'
paletteif boxplots_parameter not in df.columns:
print(f'\033[1;31mNote\033[0m: boxplots_parameter "{boxplots_parameter}" is not applied, as it must be a column of "{df_name}" DataFrame.\n'
f'To avoid this message, input a relevant column name or set: "boxplots_parameter=None".')
= None, None # avoiding error in the next step when building boxplots
boxplots_parameter, palette else:
= 10 # maximum number of boxes displayed within one graph
boxplots_parameter_limit = df[boxplots_parameter].nunique() # the number of unique values of boxplots_parameter
boxplots_parameter_number if boxplots_parameter_number > boxplots_parameter_limit:
print(f'\033[1;31mNote\033[0m: `boxplots_parameter` "{boxplots_parameter}" is not applied, as the number of its unique values exceeds the threshold of {boxplots_parameter_limit}.\n'
f'To avoid this message, input another data slice or another `boxplots_parameter` with values number under the threshold level, or set: "boxplots_parameter=None."')
= None, None # avoiding error in the next step when building boxplots
boxplots_parameter, palette
= plt.subplots(1, 2, figsize=(13, 4))
fig, axes
for i, metric in enumerate(['quantity', 'revenue']):
=df, x=boxplots_parameter, hue=boxplots_parameter, y=metric,
sns.boxplot(data=show_outliers, ax=axes[i], palette=palette)
showfliers
# removing legend if it exists
= axes[i].get_legend()
legend if legend is not None:
legend.remove()
= f'The Boxplot of "{metric.title()}" in "{df_name}"{title_extension}'
title #wrapped_title = '\n'.join(textwrap.wrap(title, width=55))
= wrap_text(title, 55)
wrapped_title # REMOVE THIS LINE: axes[i].get_legend().remove()
=13, fontweight ='bold')
axes[i].set_title(wrapped_title, fontsize=12)
axes[i].set_xlabel(boxplots_parameter, fontsize=12)
axes[i].set_ylabel(metric.title(), fontsize=10, rotation=90)
axes[i].tick_params(labelsize
axes[i].yaxis.set_major_formatter(EngFormatter())
=0.3)
plt.subplots_adjust(wspace;
plt.show()
def _display_period_coverage(df, initial_df, frame_len):
"""Helper function to display period coverage information."""
print('-'*frame_len)
= {'invoice_day', 'invoice_month'}
required_columns
if not (required_columns.issubset(df.columns) and required_columns.issubset(initial_df.columns)):
print(f'\n\033[1;31mNote\033[0m: For displaying the invoice period coverage, all datasets must contain '
f'the "invoice_day" and "invoice_month" columns.\n'
f'To avoid this message, set: "show_period=False".')
return
= df['invoice_day'].min()
first_invoice_day if pd.isnull(first_invoice_day):
print('\033[1mInvoice period coverage:\033[0m does not exist')
return
# calculating periods
= df['invoice_day'].max()
last_invoice_day = 1 if first_invoice_day == last_invoice_day else (last_invoice_day - first_invoice_day).days
invoice_period = (initial_df['invoice_day'].max() - initial_df['invoice_day'].min()).days
total_period = invoice_period / total_period * 100
period_share
= math.ceil(df['invoice_month'].nunique())
invoice_months_count = math.ceil(initial_df['invoice_month'].nunique())
total_period_months_count
print(f'\033[1mInvoice period coverage:\033[0m {first_invoice_day} - {last_invoice_day} '
f'({period_share:.1f}%; {invoice_period} out of {total_period} total days; '
f'{invoice_months_count} out of {total_period_months_count} total months)')
def _display_examples(df, example_type, example_limit, random_state, frame_len):
"""Helper function to display examples from the dataset."""
print('-'*frame_len)
= {
example_methods 'sample': lambda df: df.sample(n=min(example_limit, len(df)), random_state=random_state),
'head': lambda df: df.head(min(example_limit, len(df))),
'tail': lambda df: df.tail(min(example_limit, len(df)))}
= {
example_messages 'sample': 'Random examples',
'head': 'Top rows',
'tail': 'Bottom rows'}
= example_messages.get(example_type)
message = example_methods.get(example_type)
method
print(f'\033[1m{message}:\033[0m\n')
print(method(df))
def display_pie_charts_notes(pie_chart_names, df_name, initial_df_name):
"""Helper function to display notes for pie charts."""
= {
specific_notes 'Quantity Share': (f'The <strong>"Quantity Share"</strong> pie chart represents the proportion of total item quantities, '
f'showing what percentage of all quantities in <code>{initial_df_name}</code> falls into <code>{df_name}</code>.'),
'Revenue Share': (f'The <strong>"Revenue Share"</strong> pie chart represents the proportion of total revenue, '
f'showing what percentage of all revenue in <code>{initial_df_name}</code> is generated in <code>{df_name}</code>.'),
'Entries Share': (f'The <strong>"Entries Share"</strong> pie chart represents the share of total entries (purchases), '
f'showing what percentage of all individual product purchases in <code>{initial_df_name}</code> occurs in <code>{df_name}</code>. '
f'Every entry is counted separately, even if they are associated with the same order.'),
'Invoices Coverage': (f'The <strong>"Invoices Coverage"</strong> pie chart shows the coverage of distinct invoices (orders). '
f'This metric may show a larger share than count-based metrics because it represents order range coverage '
f'rather than purchases volume. For example, if an order appears in 100 entries in total but only 1 entry '
f'falls into <code>{df_name}</code>, it still counts as one full unique order in this chart.'),
'Stock Codes Coverage': (f'The <strong>"Stock Codes Coverage"</strong> pie chart shows the coverage of distinct stock codes. '
f'This metric may show a larger share than count-based metrics because it represents stock code range coverage '
f'rather than purchases volume. For example, if a stock code appears in 100 entries in total but only 1 entry '
f'falls into <code>{df_name}</code>, it still counts as one full unique stock code in this chart.'),
'Descriptions Coverage': (f'The <strong>"Descriptions Coverage"</strong> pie chart shows the coverage of distinct product descriptions. '
f'This metric may show a larger share than count-based metrics because it represents description range coverage '
f'rather than purchases volume. For example, if a description appears in 100 entries in total but only 1 entry '
f'falls into <code>{df_name}</code>, it still counts as one full unique description in this chart.'),
'Products Coverage': (f'The <strong>"Products Coverage"</strong> pie chart shows the coverage of distinct products. '
f'This metric may show a larger share than count-based metrics because it represents product range coverage '
f'rather than purchases volume. For example, if a product appears in 100 entries in total but only 1 entry '
f'falls into <code>{df_name}</code>, it still counts as one full unique product in this chart.'),
'Customers Coverage': (f'The <strong>"Customers Coverage"</strong> pie chart shows the coverage of distinct customer IDs. '
f'This metric may show a larger share than count-based metrics because it represents customer reach '
f'rather than purchases volume. For example, if a customer made 50 purchases but only 1 purchase falls into '
f'<code>{df_name}</code>, they still count as one full unique customer in this chart.')}
# getting only the notes for charts that were actually displayed
= {}
notes_to_display for name in pie_chart_names:
if name in specific_notes:
= f'<li><i>{specific_notes[name]}</i></li>' # creating dynamic formatted HTML list of notes
notes_to_display[name]
return notes_to_display
Function: wrap_text
The function wraps text into multiple lines, ensuring each line is within the specified width, while leaving shorter text unchanged. It distinguishes between text in “snake_case” format and ordinary text with words separated by spaces, treating each format appropriately.
def wrap_text(text, max_width=25):
"""
Wraps a given text into multiple lines ensuring that each line doesn't exceed `max_width`.
If the text follows "snake_case" format it is wrapped at underscores.
Otherwise it is wrapped at spaces between words (useful e.g. for notes that must be limited in string length)
Input:
- text (str): a text to be wrapped.
- max_width (int): maximum line width. Default - 25.
Output:
- The wrapped text (str)
"""
# handling text that in "snake_case" format (e.g. labels for charts)
if _is_snake_case(text):
if len(text) <= max_width:
return text
= text.split('_')
parts = []
wrapped = ''
current_line for part in parts:
if len(current_line) + len(part) <= max_width:
= f'{current_line}_{part}' if current_line else part
current_line else:
wrapped.append(current_line)= f'_{part}'
current_line if current_line: # appending the last line
wrapped.append(current_line)return '\n'.join(wrapped)
# handling text separated by spaces (e.g. for notes that must be limited in string length)
else:
return '\n'.join(textwrap.wrap(text, width=max_width))
def _is_snake_case(text):
= r'^[a-z0-9]+(_[a-z0-9]+)*$'
pattern return bool(re.match(pattern, text))
# checking `InvoiceNo` column - whether it contains only integers
try:
'InvoiceNo'] = df_ecom['InvoiceNo'].astype(int)
df_ecom[= True
contains_only_integers except ValueError:
= False
contains_only_integers
print(f'\033[1mThe `InvoiceNo` column contains integers only:\033[0m {contains_only_integers}')
The `InvoiceNo` column contains integers only: False
Observations and Decisions
InvoiceNo
and CustomerID
columns contain not only integers, so we will leave their original data types as they are by now.CustomerID
data type from float to string after addressing the missing values in this column.InvoiceDate
column only.Implementation of Decisions
'InvoiceDate'] = pd.to_datetime(df_ecom['InvoiceDate']) df_ecom[
# converting camelCase to snake_case format (which in my opinion looks more lucid)
def camel_to_snake(name):
= re.sub('([a-z0-9])([A-Z])', r'\1_\2', name)
c_to_s return c_to_s.lower()
= [camel_to_snake(column) for column in df_ecom.columns]
df_ecom.columns df_ecom.columns
Index(['invoice_no', 'stock_code', 'description', 'quantity', 'invoice_date', 'unit_price', 'customer_id'], dtype='object')
# investigating negative values in `quantity` column
= df_ecom[df_ecom['quantity'] < 0].copy()
negative_qty_df
=df_ecom, show_qty_rev=False, show_boxplots=True, show_period=False,
share_evaluation(negative_qty_df, initial_df=True, example_type='sample', example_limit=3) show_example
======================================================================================================================================================
Evaluation of share: negative_qty_df
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 10624 (2.0% of all entries)
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code description quantity invoice_date unit_price customer_id
455405 575613 23118 check -6 2019-11-08 12:47:00 0.00 NaN
170544 C551329 21714 CITRONELLA CANDLE GARDEN POT -2 2019-04-25 16:13:00 1.25 14626.00
155864 C550024 22456 NATURAL SLATE CHALKBOARD LARGE -3 2019-04-12 11:19:00 4.95 13089.00
======================================================================================================================================================
# investigating negative values in `UnitPrice` column
= df_ecom[df_ecom['unit_price'] < 0]
negative_unit_price_df
=df_ecom, show_qty_rev=False, show_period=False,
share_evaluation(negative_unit_price_df, initial_df=True, example_type='sample', example_limit=3) show_example
======================================================================================================================================================
Evaluation of share: negative_unit_price_df
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 2 (0.0% of all entries)
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code description quantity invoice_date unit_price customer_id
299983 A563186 B Adjust bad debt 1 2019-08-10 14:51:00 -11062.06 NaN
299984 A563187 B Adjust bad debt 1 2019-08-10 14:52:00 -11062.06 NaN
======================================================================================================================================================
Observations and Decisions
Implementation of Decisions
# getting rid of negative unit prices
= data_reduction(df_ecom, lambda df: df.query('unit_price >= 0')) df_ecom
Number of entries cleaned out from the "df_ecom": 2 (0.0%)
# investigating missing values in the `customer_id` column
= df_ecom[df_ecom['customer_id'].isna()]
missing_customer_id
=df_ecom, show_qty_rev=False, show_period=False,
share_evaluation(missing_customer_id, initial_df=True, example_type='sample', example_limit=5) show_example
======================================================================================================================================================
Evaluation of share: missing_customer_id
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 135078 (24.9% of all entries)
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code description quantity invoice_date unit_price customer_id
30751 538880 22303 COFFEE MUG APPLES DESIGN 4 2018-12-12 15:52:00 5.06 NaN
68516 541869 85132A CHARLIE + LOLA BISCUITS TINS 1 2019-01-22 09:35:00 8.29 NaN
435808 574076 23340 VINTAGE CHRISTMAS CAKE FRILL 1 2019-10-31 15:38:00 3.29 NaN
352490 567673 21980 PACK OF 12 RED RETROSPOT TISSUES 1 2019-09-19 15:43:00 0.83 NaN
478386 577078 22600 CHRISTMAS RETROSPOT STAR WOOD 4 2019-11-15 15:17:00 1.63 NaN
======================================================================================================================================================
# investigating missing values in the `description` column
= df_ecom[df_ecom['description'].isna()]
missing_descriptions
=df_ecom, show_qty_rev=False, show_period=False,
share_evaluation(missing_descriptions, initial_df=True, example_type='sample', random_state=7, example_limit=5)
show_example
= missing_descriptions['quantity'].sum()
missing_descriptions_qty = abs( missing_descriptions_qty/ df_ecom['quantity'].sum())
missing_descriptions_qty_share
print(f'\033[1mQuantity in the entries with missing descriptions:\033[0m {missing_descriptions_qty} ({missing_descriptions_qty_share *100 :0.1f}% of the total quantity).\n')
======================================================================================================================================================
Evaluation of share: missing_descriptions
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 1454 (0.3% of all entries)
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code description quantity invoice_date unit_price customer_id
74287 542417 84966B NaN -11 2019-01-25 17:38:00 0.00 NaN
250532 559037 82583 NaN 10 2019-07-03 15:29:00 0.00 NaN
171180 551394 16015 NaN 400 2019-04-26 12:37:00 0.00 NaN
468448 576473 21868 NaN -108 2019-11-13 11:40:00 0.00 NaN
201752 554316 21195 NaN -1 2019-05-21 15:29:00 0.00 NaN
======================================================================================================================================================
Quantity in the entries with missing descriptions: -13609 (0.3% of the total quantity).
Observations
customer_id
column consists of ~25% missing values; this might reflect guest checkouts or unregistered users.description
column has 0.3% missing values, which account for 0.3% of the total quantity. According to sample entries, these missing values might be associated with data corrections, as the unit price is zero and many entries have a negative quantity.Decisions
customer_id
is not crucial for our study, and considering that a substantial portion of the data (~1/4) is affected by missing values in this column, we won’t discard these records. Instead, we will convert the missing values incustomer_id
column to zeros to ensure proper data processing. As decided above we will convert the float data type to string.Implementation of Decisions
# converting the missing values to zeros in the `customer_id` column
= df_ecom.copy() # avoiding SettingWithCopyWarning
df_ecom 'customer_id'] = df_ecom['customer_id'].fillna(0) df_ecom[
# converting the `customer_id` column to string type (first we convert the float to an integer, dropping any decimal places in naming).
'customer_id'] = df_ecom['customer_id'].astype(int).astype(str) df_ecom[
# discarding records with missing descriptions
= data_reduction(df_ecom, lambda df: df.dropna(subset=['description'])) df_ecom
Number of entries cleaned out from the "df_ecom": 1454 (0.3%)
As expected, after converting the missing values to zeros in the customer_id
column, the float type was successfully converted to integer.
# checking duplicates
= df_ecom[df_ecom.duplicated()]
duplicates
=df_ecom, show_qty_rev=False, show_period=False,
share_evaluation(duplicates, initial_df=True, example_type='head', example_limit=5) show_example
======================================================================================================================================================
Evaluation of share: duplicates
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 5268 (1.0% of all entries)
------------------------------------------------------------------------------------------------------------------------------------------------------
Top rows:
invoice_no stock_code description quantity invoice_date unit_price customer_id
517 536409 21866 UNION JACK FLAG LUGGAGE TAG 1 2018-11-29 11:45:00 1.25 17908
527 536409 22866 HAND WARMER SCOTTY DOG DESIGN 1 2018-11-29 11:45:00 2.10 17908
537 536409 22900 SET 2 TEA TOWELS I LOVE LONDON 1 2018-11-29 11:45:00 2.95 17908
539 536409 22111 SCOTTIE DOG HOT WATER BOTTLE 1 2018-11-29 11:45:00 4.95 17908
555 536412 22327 ROUND SNACK BOXES SET OF 4 SKULLS 1 2018-11-29 11:49:00 2.95 17920
======================================================================================================================================================
# getting rid of duplicates
= data_reduction(df_ecom, lambda df: df.drop_duplicates()) df_ecom
Number of entries cleaned out from the "df_ecom": 5268 (1.0%)
# adding extra period-related columns
'invoice_year'] = df_ecom['invoice_date'].dt.year
df_ecom['invoice_month'] = df_ecom['invoice_date'].dt.month
df_ecom['invoice_year_month'] = df_ecom['invoice_date'].dt.strftime('%Y-%m')
df_ecom['invoice_week'] = df_ecom['invoice_date'].dt.isocalendar().week
df_ecom['invoice_year_week'] = df_ecom['invoice_date'].dt.strftime('%G-Week-%V')
df_ecom['invoice_day'] = df_ecom['invoice_date'].dt.date
df_ecom['invoice_day_of_week'] = df_ecom['invoice_date'].dt.weekday
df_ecom['invoice_day_name'] = df_ecom['invoice_date'].dt.day_name()
df_ecom[
'revenue'] = df_ecom['unit_price'] * df_ecom['quantity']
df_ecom[
# checking the result
3) df_ecom.sample(
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
198304 | 554030 | 22027 | TEA PARTY BIRTHDAY CARD | 12 | 2019-05-18 13:56:00 | 0.42 | 16497 | 2019 | 5 | 2019-05 | 20 | 2019-Week-20 | 2019-05-18 | 5 | Saturday | 5.04 |
322709 | 565233 | 84912B | GREEN ROSE WASHBAG | 1 | 2019-08-31 09:34:00 | 3.29 | 0 | 2019 | 8 | 2019-08 | 35 | 2019-Week-35 | 2019-08-31 | 5 | Saturday | 3.29 |
156356 | 550134 | 22087 | PAPER BUNTING WHITE LACE | 18 | 2019-04-12 13:50:00 | 2.95 | 16249 | 2019 | 4 | 2019-04 | 15 | 2019-Week-15 | 2019-04-12 | 4 | Friday | 53.10 |
We set two primary objectives for the EDA part of the project:
Let’s note here, that the focused Product Range Analysis will be conducted in the next phase, utilizing the data cleaned at this EDA stage.
Given the complexity of our study, we will arrange the plan for each component of EDA, describing parameters and study methods.
Parameters to study
Distribution analysis
Top performers analysis
Methods of study
distribution_IQR
function will be handy for this purpose.share_evaluation
function for this purpose.plot_totals_distribution
function for this purpose.⚠ Note: despite some parts of our distribution analysis (like mutually exclusive entries or high-volume customers) go beyond common distribution analysis, keeping them here is reasonable as they provide early insights meaningful for later stages.
invoice_no
)
stock_code
) and Item Name (description
)
invoice_no
and stock_code
to detect operational or non-product entries. We will filter those, containing letters (during initial data inspection we detected that invoice_no
and stock_code
columns contain not only integers).⚠ Note: *The identifiers analysis may be integrated into the distribution analysis**, if we find that deeper investigation of identifiers is necessary at that stage.*
Parameters’ totals and typical unit price by month
Invoice parameters by month
Parameters by day of the week
Distribution of invoices by week
Parameters change dynamics by month
boxplots
function for this purpose.plot_totals_distribution
function for this purpose.While the core of our project is focused on Product Range Analysis, studying additional parameters such as unique customers by month or the correlation between average invoice revenue and day of the week is not central to our primary goal. However, these extra analyses are not highly time-consuming and may reveal valuable insights that contribute to a more comprehensive understanding of sales patterns.
When making decisions about removing irrelevant data, we will ask ourselves several questions:
To conclude:
Since we need to study several parameters with similar approach, it’s reasonable to create a universal but adjustable set of tools for this purpose. The main tool will be a function called distribution_IQR
. It will take our study parameters as input and provide graphs and calculations for data visualization and “cleaning” purposes (see the function description below for details).
For defining the limits of outliers in this function we will use “1.5*IQR approach” (whiskers of the boxplot).
But we won’t do it blindly, for instance we will use the “percentile approach” as well when reasonable (since not all parameters can be treated same way). A percentile_outliers
function is built for this purpose.
An additional get_sample_size
function will serve us for quicker plotting of large datasets, where full resolution is not necessary.
The plot_totals_distribution
function is designed for quick calculation and visualization of either or both distributions and totals for selected parameters, allowing for the display of random, best, or worst performers.
Thanks to previous projects, two of these functions are already in the workpiece, the only thing that currently remains is minor adjustments.
Function: get_sample_size
def get_sample_size(df, target_size=10000, min_sample_size=0.01, max_sample_size=1):
"""
The function calculates optimal fracion of data to reduce DataFrame size.
It would be applied for quicker plotting of large datasets, where full resolution is not needed.
As input this function takes:
- df (DataFrame): the DataFrame to be reduced if needed.
- target_size (int): desired sample size (default - 10000)
- min_sample_size (float): minimum sampling fraction (default - 0.01, which means 1% of the df)*
- max_sample_size (float): maximum sampling fraction (default - 1, which means 100% of the df)
Output:
- float: sampling fraction between min and max, or 1 if df is smaller than target_size
----------------
Note: A target_size in thousands typically provides a sufficient representation of the overall data distribution for most plotting purposes.
However, accuracy may vary based on data complexity. A higher target_size results in slower graph plotting, but more reliable outcomes.
----------------
"""
= len(df)
current_size if current_size <= target_size:
return 1 # no sampling needed
= target_size / current_size
sample_size return max(min(sample_size, max_sample_size), min_sample_size)
Function: distribution_IQR
def distribution_IQR(df, parameter, x_limits=None, title_extension='', bins=[50, 100], outliers_info=True, speed_up_plotting=True, target_sample=10000, frame_len=50):
"""
The function analyzes the distribution of a specified DataFrame column using discriptive statistics, histograms and boxplots.
As input this function takes:
- df: the DataFrame containing the data to be analyzed.
- parameter (str): the column of the DataFrame to be analyzed.
- x_limits (list of float, optional): the x-axis limits for the histogram. If None, limits are set automatically. Default is None.
- title_extension (str, optional): additional text to append to the summary and plot titles. Default - empty string.
- bins (list of int, optional): list of bin numbers for histograms. Default - [50, 100].
- outliers_info (bool, optional): whether to display summary statistics and information on outliers. Default - True.
- speed_up_plotting (bool, optional): whether to speed up plotting by using a sample data slice of the DataFrame instead of the full DataFrame.
This option can significantly reduce plotting time for large datasets (tens of thousands of rows or more) when full resolution is not necessary.
Note that using a sample may slightly reduce the accuracy of the visualization, but is often sufficient for exploratory analysis. Default - True.
- target_sample (int, optional): the desired sample size when 'speed_up_plotting' is True. This parameter is passed to the get_sample_size function
to determine the appropriate sampling fraction. A larger 'target_sample' will result in a more accuracy of the visualization but slower plotting.
Default - 10000.
- frame_len (int, optional): the length of frame of printed outputs. Default - 50.
As output the function presents:
- Displays several histograms with set bin numbers.
- Displays two boxplots: the first with outliers included, and the second with outliers excluded.
- Provides main descriptive statistics for the specified parameter.
- Provides the upper and lower limits of outliers (if 'outliers_info' is set to True).
"""
# retrieving the name of the data slice
= get_df_name(df) if get_df_name(df) != "name not found" else "the DataFrame"
df_name
# adjusting the title extension
if title_extension:
= f' {title_extension}'
title_extension
# plotting histograms of the parameter distribution for each bin number
if speed_up_plotting:
= get_sample_size(df, target_size=target_sample)
frac if frac != 1:
= df.sample(frac=frac, replace=False, random_state=7) # ensuring consistency across runs and preventing multiple sampling of the same row.
df_sampled = f'{frac*100:.0f}%'
dataset_size print(f'\n\033[1mNote\033[0m: A sample data slice {dataset_size} of "{df_name}" was used for histogram plotting instead of the full DataFrame.\n'
f'This significantly reduced plotting time for the large dataset. '
f'The accuracy of the visualization might be slightly reduced, '
f'meanwhile it should be sufficient for exploratory analysis.\n')
else:
= df
df_sampled = 'Full Dataset'
dataset_size else:
= 'Full Dataset'
dataset_size = df
df_sampled
if not isinstance(bins, list): # addressing the case of only one integer bins number (creating a list of 1 integer, for proper processing later in the code)
try:
= [int(bins)] # convert bins to int and create a list
bins except:
print("Bins is not a list or integer")
if len(bins) == 2:
= plt.subplots(1, 2, figsize=(14, 3.5))
fig, axes for i in [0, 1]:
=bins[i], ax=axes[i])
sns.histplot(df_sampled[parameter], bins= f'The Histogram of "{parameter}" in "{df_name}"{title_extension}, bins = {bins[i]}, sample size = {dataset_size}'
title = wrap_text(title, 55) # adjusting title width when it's necessary
wrapped_title =13, fontweight ='bold')
axes[i].set_title(wrapped_title, fontsize=12)
axes[i].set_xlabel(parameter, fontsize'Frequency', fontsize=12)
axes[i].set_ylabel(=10)
axes[i].tick_params(labelsize
# set manual xlim if it's provided
if x_limits is not None:
axes[i].set_xlim(x_limits)
plt.tight_layout()=0.3, hspace=0.2)
plt.subplots_adjust(wspace
plt.show() else:
for i in bins:
=(6, 3))
plt.figure(figsize=i)
sns.histplot(df_sampled[parameter], bins= f'The Histogram of "{parameter}" in "{df_name}"{title_extension}, bins={i}, sample size = {dataset_size}'
title = wrap_text(title, 55) # adjusting title width when it's necessary
wrapped_title =13, fontweight ='bold')
plt.title(wrapped_title, fontsize=12)
plt.xlabel(parameter, fontsize'Frequency', fontsize=12)
plt.ylabel(=10)
plt.tick_params(labelsize
# set manual xlim if it's provided
if x_limits is not None:
plt.xlim(x_limits)
plt.show() print('\n')
# plotting a boxplot of the parameter distribution
= plt.subplots(1, 2, figsize=(17.4, 1.5))
fig, axes for i in [0, 1]:
=df[parameter], showfliers=(True if i == 0 else False), ax=axes[i])
sns.boxplot(x= f'The Boxplot of "{parameter}" in "{df_name}"{title_extension} {"With Outliers" if i == 0 else "Without Outliers"}, Full Dataset'
title = wrap_text(title, 55) # adjusting title width when it's necessary
wrapped_title =13, fontweight='bold')
axes[i].set_title(wrapped_title, fontsize=12)
axes[i].set_xlabel(parameter, fontsize=10)
axes[i].tick_params(labelsize
=0.2, hspace=0.2)
plt.subplots_adjust(wspace
plt.show() print('\n')
# calculating and displaying descriptive statistics of the parameter and a summary about its distribution skewness
print('='*frame_len)
f'**Statistics on `{parameter}` in `{df_name}`{title_extension}**\n'))
display(Markdown(print(f'{df[parameter].describe()}')
#print('Median:', round(df[parameter].median(),1)) #may be redundant, as describe() method already provides 50% value
print('-'*frame_len)
# defining skewness
= df[parameter].skew()
skewness = abs(skewness)
abs_skewness
if abs_skewness < 0.5:
= '\033[1;32mslightly skewed\033[0m' # green
skewness_explanation elif abs_skewness < 1:
= '\033[1;33mmoderately skewed\033[0m' # yellow
skewness_explanation elif abs_skewness < 5:
= '\033[1;31mhighly skewed\033[0m' # red
skewness_explanation else:
= '\033[1;31;2mextremely skewed\033[0m' # dark red
skewness_explanation
= 'right' if skewness > 0 else 'left'
direction print(f'The distribution is {skewness_explanation} to the {direction} \n(skewness: {skewness:.1f})')
print(f'\n\033[1mNote\033[0m: outliers affect skewness calculation')
# calculating and displaying descriptive statistics and information on outliers
if outliers_info:
= round(df[parameter].quantile(0.25))
Q1 = round(df[parameter].quantile(0.75))
Q3 = Q3 - Q1
IQR = Q1 - round(1.5 * IQR)
min_iqr = Q3 + round(1.5 * IQR)
max_iqr
print('-'*frame_len)
print('Min border:', min_iqr)
print('Max border:', max_iqr)
print('-'*frame_len)
= len(df[parameter])
total_count = len(df[(df[parameter] < min_iqr) | (df[parameter] > max_iqr)])
outliers_count = len(df[df[parameter] > max_iqr])
outliers_over_max_iqr_count = round(outliers_count / total_count * 100, 1)
outlier_percentage = round(outliers_over_max_iqr_count/ total_count * 100, 1)
outlier_over_max_iqr_percentage
if min_iqr < 0:
print(f'The outliers are considered to be values above {max_iqr}')
print(f'We have {outliers_over_max_iqr_count} values that we can consider outliers')
print(f'Which makes {outlier_over_max_iqr_percentage}% of the total "{parameter}" data')
else:
print(f'The outliers are considered to be values below {min_iqr} and above {max_iqr}')
print(f'We have {outliers_count} values that we can consider outliers')
print(f'Which makes {outlier_percentage}% of the total "{parameter}" data')
print('='*frame_len)
Function: percentile_outliers
def percentile_outliers(df, parameter, title_extension='', lower_percentile=3, upper_percentile=97, frame_len=70, print_limits=False):
"""
The function identifies outliers in a DataFrame column using percentile limits.
As input this function takes:
- df: the DataFrame containing the data to be analyzed.
- parameter (str): the column of the DataFrame to be analyzed.
- title_extension (str, optional): additional text to append to the plot titles. Default - empty string.
- lower percentile (int, float): the lower percentile threshold. Default - 3)
- upper percentile (int, float): the upper percentile threshold. Default - 97)
- frame_len (int, optional): the length of frame of printed outputs. Default - 70.
- print_limits (bool, optional): whether to print the limits dictionary. Default - False.
As output the function presents:
- upper and lower limits of outliers and their share of the innitial DataFrame
- the function creates the dictionary with limits names and their values and updates the global namespace respectively.
"""
# adjusting output frame width
if print_limits:
= 110
frame_len
# adjusting the title extension
if title_extension:
= f' {title_extension}'
title_extension
# calculating the lower and upper percentile limits
= round(np.percentile(df[parameter], lower_percentile), 2)
lower_limit = round(np.percentile(df[parameter], upper_percentile), 2)
upper_limit
# identifying outliers
= df[(df[parameter] < lower_limit) | (df[parameter] > upper_limit)]
outliers = len(outliers)
outliers_count = len(df[parameter])
total_count = round(outliers_count / total_count * 100, 1)
outlier_percentage
# displaying data on outliers
print('='*frame_len)
f'**Data on `{parameter}` outliers {title_extension} based on the "percentile approach"**\n'))
display(Markdown(print(f'The outliers are considered to be values below {lower_limit} and above {upper_limit}')
print(f'We have {outliers_count} values that we can consider outliers')
print(f'Which makes {outlier_percentage}% of the total "{parameter}" data')
# retrieving the df name
= get_df_name(df) if get_df_name(df) != "name not found" else "df"
df_name
# creating dynamic variable names
= f'{df_name}_{parameter}_lower_limit'
lower_limit_name = f'{df_name}_{parameter}_upper_limit'
upper_limit_name
# creating a limits dictionary
= {lower_limit_name: lower_limit, upper_limit_name: upper_limit} # we can refer to them in further analyses, if needed
limits
# updating global namespace with the limits
globals().update(limits)
# printing limits, if required
if print_limits:
print('-'*frame_len)
print(f'Limits: {limits}')
print('='*frame_len)
Function: plot_totals_distribution
def plot_totals_distribution(df, parameter_column, value_column, n_items=20, sample_type='head', random_state=None,
=False, fig_height=500, fig_width=1000, color_palette=None,
show_outliers=False, title_start=True, title_extension='', plot_totals=True, plot_distribution=True, consistent_colors=False):
sort_ascending"""
This function calculates and displays the following:
- A horizontal bar chart of the specified items by total value (optional).
- Box plots showing the distribution of values for each specified item (optional).
As input the function takes:
- df (DataFrame): the data to be analyzed.
- parameter_column (str): name of the column containing the names of parameters (e.g., product names).
- value_column (str): name of the column containing the values to be analyzed (e.g., 'quantity').
- n_items (int, optional): number of items to display. Default - 20.
- sample_type (str, optional): type of sampling to use. Options are 'sample', 'head', or 'tail'. Default - 'head'.
- random_state (int, optional): controls the randomness of sample selection. Default - None.
- show_outliers (bool, optional): whether to display outliers in the box plots. Default - False.
- fig_height (int, optional): height of the figure in pixels. Default - 600.
- fig_width (int, optional): width of the figure in pixels. Default - 1150.
- color_palette (list, optional): list of colors to use for the plots.
If None, uses px.colors.qualitative.Pastel. Default - None.
- sort_ascending (bool, optional): if True, sorts the displayed parameters in ascending order based on the value column. Sorting is not applied in case of random sampling (when 'sample_type' = 'sample'). Default - False.
- title_start (bool, optional): whether to display information about sampling type in the beginning of a title. Default - True.
- title_extension (str, optional): additional text to append to the plot title. Default - empty string.
- plot_totals (bool, optional): if True, plots the totals bar chart. If False, only plots the distribution (if enabled). Default - True.
- plot_distribution (bool, optional): if True, plots the distribution alongside totals. If False, only plots totals. Default - True.
- consistent_colors (bool, optional): if True, uses the same colors for the same parameter values across different runs. Default - False.
As output the function presents:
- A plotly figure containing one or both visualizations side by side.
"""
# handling error in case of wrong/lacking `parameter_column` or `value_column`
if parameter_column not in df.columns or value_column not in df.columns:
raise ValueError(f'Columns {parameter_column} and/or {value_column} not found in {get_df_name(df)}.')
# defining sampling methods and messages
= {
sampling_methods 'sample': lambda df: df.sample(n=min(n_items, len(df)), random_state=random_state),
'head': lambda df: df.nlargest(min(n_items, len(df)), value_column),
'tail': lambda df: df.nsmallest(min(n_items, len(df)), value_column)}
= {
sampling_messages 'sample': 'Random',
'head': 'Top',
'tail': 'Bottom'}
# setting default color pallet
if color_palette is None:
= px.colors.qualitative.Pastel
color_palette
# creating a color mapping if consistent_colors is True
= None
color_mapping if consistent_colors:
= df[parameter_column].unique()
all_parameters = {
color_mapping % len(color_palette)] # reusing colors from the palette if there are more parameters than colors
param: color_palette[i for i, param in enumerate(all_parameters)}
# grouping data by parameter
= df.groupby(parameter_column)[value_column].sum().reset_index()
df_grouped
# applying sampling method
= sampling_methods[sample_type](df_grouped)
selected_parameters
# applying sorting if needed (except for random sampling)
if sample_type != 'sample':
#selected_parameters = selected_parameters.sort_values(value_column, ascending=sort_ascending)
= selected_parameters.sort_values(value_column, ascending=not sort_ascending) # reversing the sorting direction (without reversing, sort_ascending=True results in bigger bars at the top of a Totals plot, which is counterintuitive)
selected_parameters
# setting the subplot
if plot_totals and plot_distribution:
= make_subplots(
fig =1, cols=2,
rows=(f'<b>\"{value_column}\" Totals</b>', f'<b>\"{value_column}\" Distribution</b>'),
subplot_titles=0.05)
horizontal_spacingelif plot_totals:
= make_subplots(rows=1, cols=1, subplot_titles=(f'<b>\"{value_column}\" Totals</b>',))
fig elif plot_distribution:
= make_subplots(rows=1, cols=1, subplot_titles=(f'<b>\"{value_column}\" Distribution</b>',))
fig else:
raise ValueError('At least one of `plot_totals` or `plot_distribution` must be True.')
# plotting bar chart of totals (left subplot)
if plot_totals:
# determining the colors to use
if consistent_colors:
= [color_mapping[param] for param in selected_parameters[parameter_column]]
bar_colors else:
= [color_palette[i % len(color_palette)] for i in range(len(selected_parameters))] # reusing colors from the palette if there are more parameters than colors
bar_colors
fig.add_trace(
go.Bar(=selected_parameters[value_column],
x=selected_parameters[parameter_column],
y='h',
orientation=[EngFormatter(places=1)(x) for x in selected_parameters[value_column]],
text='inside',
textposition=bar_colors,
marker_color=False),
showlegend=1, col=1 if plot_distribution else 1)
row
# plotting box plot chart of totals (right subplot)
if plot_distribution:
= selected_parameters[parameter_column].tolist()
selected_parameters_list
for parameter_id, parameter_value in enumerate(selected_parameters_list):
= df[df[parameter_column] == parameter_value]
parameter_data
# determining outliers and bounds for future boxplots
if not show_outliers:
= parameter_data[value_column].quantile(0.25)
q1 = parameter_data[value_column].quantile(0.75)
q3 = q3 - q1
iqr
= parameter_data[
parameter_data >= q1 - 1.5 * iqr) &
(parameter_data[value_column] <= q3 + 1.5 * iqr)]
(parameter_data[value_column]
# determining the colors to use
if consistent_colors:
= color_mapping[parameter_value]
box_color else:
= color_palette[parameter_id % len(color_palette)] # reusing colors from the palette if there are more parameters than colors
box_color
# adding a box plot for this item
fig.add_trace(
go.Box(=parameter_data[value_column],
x=[parameter_value] * len(parameter_data),
y=parameter_value,
name='h',
orientation=False,
showlegend=box_color,
marker_color='outliers' if show_outliers else False),
boxpoints=1, col=2 if plot_totals else 1)
row
# adjusting the appearance
= f'{sampling_messages[sample_type]} {n_items}'
sampling_message
if title_start:
= sampling_message
title_start else:
= ''
title_start
= f'<b>{title_start} \"{value_column}\" by \"{parameter_column}\"{" " + title_extension if title_extension else ""}: {"Totals and Distribution" if plot_totals and plot_distribution else "Totals" if plot_totals else "Distribution"}</b>'
title_text
fig.update_layout(=fig_height,
height=fig_width,
width={
title'text': title_text,
'font_size': 19, 'y': 0.95, 'x': 0.5})
if plot_totals:
=value_column, row=1, col=1)
fig.update_xaxes(title_textif plot_distribution:
=value_column, title_font=dict(size=14), row=1, col=2 if plot_totals else 1)
fig.update_xaxes(title_text=parameter_column, title_font=dict(size=14), row=1, col=1)
fig.update_yaxes(title_textif plot_totals:
='', showticklabels=False, row=1, col=2)
fig.update_yaxes(title_textelse:
=parameter_column, row=1, col=1)
fig.update_yaxes(title_text
return fig.show();
# checking outliers with IQR approach + descriptive statistics
=df_ecom, parameter='quantity', title_extension='', x_limits=[-20, 60], bins=[500, 2000], speed_up_plotting=True, outliers_info=True) distribution_IQR(df
Note: A sample data slice 2% of "df_ecom" was used for histogram plotting instead of the full DataFrame.
This significantly reduced plotting time for the large dataset. The accuracy of the visualization might be slightly reduced, meanwhile it should be sufficient for exploratory analysis.
==================================================
Statistics on quantity
in df_ecom
count 535185.00
mean 9.67
std 219.06
min -80995.00
25% 1.00
50% 3.00
75% 10.00
max 80995.00
Name: quantity, dtype: float64
--------------------------------------------------
The distribution is slightly skewed to the left
(skewness: -0.3)
Note: outliers affect skewness calculation
--------------------------------------------------
Min border: -13
Max border: 24
--------------------------------------------------
The outliers are considered to be values above 24
We have 32411 values that we can consider outliers
Which makes 6.1% of the total "quantity" data
==================================================
# let's check descriptive statistics of quantity by product
= df_ecom.groupby('stock_code')['quantity']
products_quantity_ranges #products_quantity_var = products_quantity_ranges.var().mean()
#products_quantity_std = products_quantity_ranges.std().mean()
= products_quantity_ranges.apply(
products_quantity_cov lambda x: (x.std() / x.mean() * 100) if x.mean() != 0 else 0)\
.mean()
#print(f'\033[1mAverage variation of a stock code quantity:\033[0m {products_quantity_var:.0f}')
#print(f'\033[1mAverage standard variation of a stock code quantity:\033[0m {products_quantity_std:.0f}')
print(f'\033[1mAverage coefficient of variation of quantity across stock codes:\033[0m {products_quantity_cov:.1f}%')
Average coefficient of variation of quantity across stock codes: 235.9%
Let’s examine outliers through a percentile methodology.
⚠ Note: Here and throughout the project, we will use a percentile methodology with relatively broad boundaries (3rd and 97th percentiles) to examine outliers, in addition to the IQR approach. As our goal is to balance outliers detection with data integrity, ensuring potentially valuable information isn’t lost.
# checking outliers with the percentile approach
=df_ecom, parameter='quantity', lower_percentile=3, upper_percentile=97, print_limits=True, frame_len=85) percentile_outliers(df
==============================================================================================================
Data on quantity
outliers based on the “percentile approach”
The outliers are considered to be values below 1.0 and above 48.0
We have 22881 values that we can consider outliers
Which makes 4.3% of the total "quantity" data
--------------------------------------------------------------------------------------------------------------
Limits: {'df_ecom_quantity_lower_limit': 1.0, 'df_ecom_quantity_upper_limit': 48.0}
==============================================================================================================
# checking the share of outliers above the upper percentile according to quantity amounts
= df_ecom.query('quantity > @df_ecom_quantity_upper_limit')
top_quantity_df
share_evaluation(top_quantity_df, df_ecom, =True,
show_qty_rev=True, show_pie_charts_notes=True,
show_pie_charts=True) show_boxplots
======================================================================================================================================================
Evaluation of share: top_quantity_df
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 13156 (2.5% of all entries)
Quantity: 2112240 (40.8% of the total quantity)
Revenue: 3001138.6 (30.8% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom
falls into top_quantity_df
.df_ecom
is generated in top_quantity_df
.df_ecom
occurs in top_quantity_df
. Every entry is counted separately, even if they are associated with the same order.------------------------------------------------------------------------------------------------------------------------------------------------------
======================================================================================================================================================
# checking the share of outliers below the lower percentile according to quantity amounts
= df_ecom.query('quantity < @df_ecom_quantity_lower_limit')
lower_quantity_outliers
share_evaluation(lower_quantity_outliers, df_ecom,=True,
show_qty_rev=True, show_pie_charts_notes=True,
show_pie_charts=True) show_boxplots
======================================================================================================================================================
Evaluation of share: lower_quantity_outliers
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 9725 (1.8% of all entries)
Quantity: -436361 (8.4% of the total quantity)
Revenue: -893979.7 (9.2% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom
falls into lower_quantity_outliers
.df_ecom
is generated in lower_quantity_outliers
.df_ecom
occurs in lower_quantity_outliers
. Every entry is counted separately, even if they are associated with the same order.------------------------------------------------------------------------------------------------------------------------------------------------------
======================================================================================================================================================
# checking the zero quantity entries
len(df_ecom.query('quantity == 0'))
0
# checking the most visually obvious outliers with positive quantity
'quantity > 20000'), df_ecom,
share_evaluation(df_ecom.query(=True,
show_qty_rev=True, example_type='sample', example_limit=3) show_example
======================================================================================================================================================
Evaluation of share: the data slice mentioned in the call function
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 2 (0.0% of all entries)
Quantity: 155210 (3.0% of the total quantity)
Revenue: 245653.2 (2.5% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code description quantity invoice_date unit_price customer_id invoice_year invoice_month \
540421 581483 23843 PAPER CRAFT , LITTLE BIRDIE 80995 2019-12-07 09:15:00 2.08 16446 2019 12
61619 541431 23166 MEDIUM CERAMIC TOP STORAGE JAR 74215 2019-01-16 10:01:00 1.04 12346 2019 1
invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue
540421 2019-12 49 2019-Week-49 2019-12-07 5 Saturday 168469.60
61619 2019-01 3 2019-Week-03 2019-01-16 2 Wednesday 77183.60
======================================================================================================================================================
# checking the most visually obvious outliers with negative quantity
'quantity < -20000'), df_ecom, show_qty_rev=True,
share_evaluation(df_ecom.query(=True, example_type='sample', example_limit=3) show_example
======================================================================================================================================================
Evaluation of share: the data slice mentioned in the call function
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 2 (0.0% of all entries)
Quantity: -155210 (3.0% of the total quantity)
Revenue: -245653.2 (2.5% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code description quantity invoice_date unit_price customer_id invoice_year invoice_month \
61624 C541433 23166 MEDIUM CERAMIC TOP STORAGE JAR -74215 2019-01-16 10:17:00 1.04 12346 2019 1
540422 C581484 23843 PAPER CRAFT , LITTLE BIRDIE -80995 2019-12-07 09:27:00 2.08 16446 2019 12
invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue
61624 2019-01 3 2019-Week-03 2019-01-16 2 Wednesday -77183.60
540422 2019-12 49 2019-Week-49 2019-12-07 5 Saturday -168469.60
======================================================================================================================================================
# checking the most visually obvious outliers altogether
'quantity > 20000 or quantity < -20000'), df_ecom, show_qty_rev=True,
share_evaluation(df_ecom.query(=True, example_type='sample', example_limit=3, frame_len=100) show_example
====================================================================================================
Evaluation of share: the data slice mentioned in the call function
in df_ecom
----------------------------------------------------------------------------------------------------
Number of entries: 4 (0.0% of all entries)
Quantity: 0 (0.0% of the total quantity)
Revenue: 0.0 (0.0% of the total revenue)
----------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code description quantity invoice_date unit_price customer_id invoice_year invoice_month \
61619 541431 23166 MEDIUM CERAMIC TOP STORAGE JAR 74215 2019-01-16 10:01:00 1.04 12346 2019 1
540422 C581484 23843 PAPER CRAFT , LITTLE BIRDIE -80995 2019-12-07 09:27:00 2.08 16446 2019 12
61624 C541433 23166 MEDIUM CERAMIC TOP STORAGE JAR -74215 2019-01-16 10:17:00 1.04 12346 2019 1
invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue
61619 2019-01 3 2019-Week-03 2019-01-16 2 Wednesday 77183.60
540422 2019-12 49 2019-Week-49 2019-12-07 5 Saturday -168469.60
61624 2019-01 3 2019-Week-03 2019-01-16 2 Wednesday -77183.60
====================================================================================================
Observations
The quantity mean (3.0) is over 3 times higher than the median (9.7), and the distribution skewness is to the right.
There is a local peak at about 20-25 items per invoice.
Significant share of outliers: 6.1% according to the “IQR approach” (not taking negative values into account) and 4.3% according to the “percentile approach” (with soft thresholds of 3rd and 97th percentiles, including negative values).
Outliers represent a minor share of all entries but account for a significant portion of quantity and revenue:
There are mutually exclusive entries where a client bought and then returned the same product (same customer id, stock code, unit price and quantity, represented by both positive and negative values). Just two most obvious cases, which are considered outliers, represent entries worth 3% of the total quantity and 2.5% of the total revenue.
At least some entries with negative quantity values have an invoice_no
starting with the letter “C”, which may correspond to “canceled” or “corrected”, indicating returns or addressing mistakes while order placement.
Decisions
Keep most of outliers with high quantities sold, as they contribute significantly to both quantity and revenue, they are essential for further Product Range Analysis.
Investigate and address entries representing negative quantity and mutually exclusive entries which intersect with them. Study the two most obvious outliers more precisely, if there is a high likelihood that they are due to mistakes rather than true returns, remove the corresponding entries, as they may seriously affect further analysis.
Investigate and address invoice_no
values starting with the letter “C” and potentially other “special” identifications.
Study wholesale purchases, as their impact seems significant.
Sales entries where a customer bought and then returned the same product can distort our further analyses. We will identify and study such operations. Based on the findings - mainly the scope of such operations, we will decide how to address them: to keep or exclude them from the main dataset for further analyses.
We will analyze returns more precisely later on to define most returned products, at this part of study we are pursuing data investigation and cleaning objectives.
# calculating sales and negative quantities entries separately
= df_ecom.query('quantity > 0').copy()
sales_df = df_ecom.query('quantity < 0').copy()
negative_qty_df 3)
sales_df.sample(3, random_state=10) negative_qty_df.sample(
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
196081 | 553768 | 22668 | PINK BABY BUNTING | 2 | 2019-05-17 10:47:00 | 5.79 | 0 | 2019 | 5 | 2019-05 | 20 | 2019-Week-20 | 2019-05-17 | 4 | Friday | 11.58 |
299473 | 563100 | 22955 | 36 FOIL STAR CAKE CASES | 6 | 2019-08-10 09:57:00 | 2.10 | 12381 | 2019 | 8 | 2019-08 | 32 | 2019-Week-32 | 2019-08-10 | 5 | Saturday | 12.60 |
100296 | 544812 | 90104 | PURPLE FRANGIPANI HAIRCLIP | 1 | 2019-02-21 15:58:00 | 0.82 | 0 | 2019 | 2 | 2019-02 | 8 | 2019-Week-08 | 2019-02-21 | 3 | Thursday | 0.82 |
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
61958 | C541492 | 85040A | S/4 PINK FLOWER CANDLES IN BOWL | -1 | 2019-01-16 14:24:00 | 1.65 | 0 | 2019 | 1 | 2019-01 | 3 | 2019-Week-03 | 2019-01-16 | 2 | Wednesday | -1.65 |
479867 | C577227 | D | Discount | -1 | 2019-11-16 12:06:00 | 14.88 | 14527 | 2019 | 11 | 2019-11 | 46 | 2019-Week-46 | 2019-11-16 | 5 | Saturday | -14.88 |
467819 | 576367 | 23071 | damages | -65 | 2019-11-12 18:31:00 | 0.00 | 0 | 2019 | 11 | 2019-11 | 46 | 2019-Week-46 | 2019-11-12 | 1 | Tuesday | -0.00 |
“sales_df”* and “negative_qty_df” are categorized based on positive and negative quantities respectively. “negative_qty_df” corresponds to returns of purchases and service entries, such as manual adjustments, discounts, and others.
In the next step we will identify indexes of sales (entries with positive quantities) and negative quantity entries. Than we will merge DataFrames based on on customer_id
, stock_code
, unit_price
, and quantity_abs
in order to extract mutually exclusive entries - thus we will identify mutually exclusive entries - those where customers both purchased and returned the same quantity of the same products at the same price.
We should note here, that this approach doesn’t cover some possible cases:
- where customer returned different amount of the same previously purchased product. - where the price of the same returned product was different. - where the return was proceeded without mentioning the proper stock code, e.g. by use of manual correction code.
*Note: As we’ve already identified, there are no zero quantity entries, thus negative_qty_df
DataFrame is in fact identical to the lower_quantity_outliers
DataFrame that we’ve studied above.
# checking the share of all entries with negative quantity
=True, show_example=True, example_type='sample', example_limit=5) share_evaluation(negative_qty_df, df_ecom, show_qty_rev
======================================================================================================================================================
Evaluation of share: negative_qty_df
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 9725 (1.8% of all entries)
Quantity: -436361 (8.4% of the total quantity)
Revenue: -893979.7 (9.2% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code description quantity invoice_date unit_price customer_id invoice_year invoice_month \
242595 C558361 23168 CLASSIC CAFE SUGAR DISPENSER -3 2019-06-26 15:13:00 1.25 15128 2019 6
310894 C564217 22666 RECIPE BOX PANTRY YELLOW DESIGN -2 2019-08-22 09:24:00 2.95 12994 2019 8
203837 C554558 22892 SET OF SALT AND PEPPER TOADSTOOLS -1 2019-05-23 10:24:00 1.25 13268 2019 5
74992 C542537 22892 SET OF SALT AND PEPPER TOADSTOOLS -3 2019-01-26 13:54:00 1.25 12501 2019 1
127053 C547187 37448 CERAMIC CAKE DESIGN SPOTTED MUG -6 2019-03-19 12:20:00 1.49 12779 2019 3
invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue
242595 2019-06 26 2019-Week-26 2019-06-26 2 Wednesday -3.75
310894 2019-08 34 2019-Week-34 2019-08-22 3 Thursday -5.90
203837 2019-05 21 2019-Week-21 2019-05-23 3 Thursday -1.25
74992 2019-01 4 2019-Week-04 2019-01-26 5 Saturday -3.75
127053 2019-03 12 2019-Week-12 2019-03-19 1 Tuesday -8.94
======================================================================================================================================================
# creating absolute quantity columns
'quantity_abs'] = sales_df['quantity']
sales_df['quantity_abs'] = negative_qty_df['quantity'].abs()
negative_qty_df[
# adding identifiers (for merging purposes)
'id'] = sales_df.index
sales_df['id'] = negative_qty_df.index
negative_qty_df[
# merging sales and returns on "customer_id", "stock_code", "unit_price", and "quantity_abs"
= pd.merge(sales_df, negative_qty_df, how='inner', on=['customer_id', 'stock_code', 'unit_price', 'quantity_abs'], suffixes=('_sales', '_returns'))
df_sales_returns
3) df_sales_returns.head(
invoice_no_sales | stock_code | description_sales | quantity_sales | invoice_date_sales | unit_price | customer_id | invoice_year_sales | invoice_month_sales | invoice_year_month_sales | invoice_week_sales | invoice_year_week_sales | invoice_day_sales | invoice_day_of_week_sales | invoice_day_name_sales | revenue_sales | quantity_abs | id_sales | invoice_no_returns | description_returns | quantity_returns | invoice_date_returns | invoice_year_returns | invoice_month_returns | invoice_year_month_returns | invoice_week_returns | invoice_year_week_returns | invoice_day_returns | invoice_day_of_week_returns | invoice_day_name_returns | revenue_returns | id_returns | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 536366 | 22632 | HAND WARMER RED POLKA DOT | 6 | 2018-11-29 08:28:00 | 1.85 | 17850 | 2018 | 11 | 2018-11 | 48 | 2018-Week-48 | 2018-11-29 | 3 | Thursday | 11.10 | 6 | 8 | C543611 | HAND WARMER RED RETROSPOT | -6 | 2019-02-08 14:38:00 | 2019 | 2 | 2019-02 | 6 | 2019-Week-06 | 2019-02-08 | 4 | Friday | -11.10 | 86889 |
1 | 536372 | 22632 | HAND WARMER RED POLKA DOT | 6 | 2018-11-29 09:01:00 | 1.85 | 17850 | 2018 | 11 | 2018-11 | 48 | 2018-Week-48 | 2018-11-29 | 3 | Thursday | 11.10 | 6 | 47 | C543611 | HAND WARMER RED RETROSPOT | -6 | 2019-02-08 14:38:00 | 2019 | 2 | 2019-02 | 6 | 2019-Week-06 | 2019-02-08 | 4 | Friday | -11.10 | 86889 |
2 | 536373 | 21071 | VINTAGE BILLBOARD DRINK ME MUG | 6 | 2018-11-29 09:02:00 | 1.06 | 17850 | 2018 | 11 | 2018-11 | 48 | 2018-Week-48 | 2018-11-29 | 3 | Thursday | 6.36 | 6 | 55 | C543611 | VINTAGE BILLBOARD DRINK ME MUG | -6 | 2019-02-08 14:38:00 | 2019 | 2 | 2019-02 | 6 | 2019-Week-06 | 2019-02-08 | 4 | Friday | -6.36 | 86896 |
# checking possible duplicates
= df_sales_returns.duplicated(subset=['customer_id', 'stock_code', 'unit_price', 'quantity_abs'])
df_sales_returns_duplicated
print('=' * table_width)
print(f'\033[1mNumber of duplicates:\033[0m {df_sales_returns_duplicated.sum()}\n')
print('\033[1mExamples of duplicates:\033[0m')
3)
df_sales_returns[df_sales_returns_duplicated].head(print('=' * table_width)
======================================================================================================================================================
Number of duplicates: 2782
Examples of duplicates:
invoice_no_sales | stock_code | description_sales | quantity_sales | invoice_date_sales | unit_price | customer_id | invoice_year_sales | invoice_month_sales | invoice_year_month_sales | invoice_week_sales | invoice_year_week_sales | invoice_day_sales | invoice_day_of_week_sales | invoice_day_name_sales | revenue_sales | quantity_abs | id_sales | invoice_no_returns | description_returns | quantity_returns | invoice_date_returns | invoice_year_returns | invoice_month_returns | invoice_year_month_returns | invoice_week_returns | invoice_year_week_returns | invoice_day_returns | invoice_day_of_week_returns | invoice_day_name_returns | revenue_returns | id_returns | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 536372 | 22632 | HAND WARMER RED POLKA DOT | 6 | 2018-11-29 09:01:00 | 1.85 | 17850 | 2018 | 11 | 2018-11 | 48 | 2018-Week-48 | 2018-11-29 | 3 | Thursday | 11.10 | 6 | 47 | C543611 | HAND WARMER RED RETROSPOT | -6 | 2019-02-08 14:38:00 | 2019 | 2 | 2019-02 | 6 | 2019-Week-06 | 2019-02-08 | 4 | Friday | -11.10 | 86889 |
4 | 536375 | 21071 | VINTAGE BILLBOARD DRINK ME MUG | 6 | 2018-11-29 09:32:00 | 1.06 | 17850 | 2018 | 11 | 2018-11 | 48 | 2018-Week-48 | 2018-11-29 | 3 | Thursday | 6.36 | 6 | 72 | C543611 | VINTAGE BILLBOARD DRINK ME MUG | -6 | 2019-02-08 14:38:00 | 2019 | 2 | 2019-02 | 6 | 2019-Week-06 | 2019-02-08 | 4 | Friday | -6.36 | 86896 |
5 | 536375 | 82483 | WOOD 2 DRAWER CABINET WHITE FINISH | 2 | 2018-11-29 09:32:00 | 4.95 | 17850 | 2018 | 11 | 2018-11 | 48 | 2018-Week-48 | 2018-11-29 | 3 | Thursday | 9.90 | 2 | 74 | C543611 | WOOD 2 DRAWER CABINET WHITE FINISH | -2 | 2019-02-08 14:38:00 | 2019 | 2 | 2019-02 | 6 | 2019-Week-06 | 2019-02-08 | 4 | Friday | -9.90 | 86897 |
======================================================================================================================================================
# cleaning out the duplicates
= df_sales_returns.drop_duplicates(subset=['customer_id', 'stock_code', 'unit_price', 'quantity_abs'])
df_sales_returns_cleaned
# checking the result
=['customer_id', 'stock_code', 'unit_price', 'quantity_abs']).sum() df_sales_returns_cleaned.duplicated(subset
0
# extracting ids of mutually exclusive entries
= df_sales_returns_cleaned['id_sales']
sales_excl_ids = df_sales_returns_cleaned['id_returns']
returns_excl_ids = pd.concat([sales_excl_ids, returns_excl_ids])
sales_returns_excl_ids
print('=' * 38)
print('\033[1mNumber of Sales IDs:\033[0m', len(sales_excl_ids))
print('\033[1mNumber of Returns IDs:\033[0m',len(returns_excl_ids))
print('\033[1mNumber of Sales and Returns IDs:\033[0m', len(sales_returns_excl_ids))
print('=' * 38)
======================================
Number of Sales IDs: 3139
Number of Returns IDs: 3139
Number of Sales and Returns IDs: 6278
======================================
# identifying mutually exclusive entries
= df_ecom.loc[sales_excl_ids]
sales_excl = df_ecom.loc[returns_excl_ids]
returns_excl = df_ecom.loc[sales_returns_excl_ids]
sales_returns_excl 3) sales_returns_excl.sample(
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
86862 | C543606 | 22847 | BREAD BIN DINER STYLE IVORY | -1 | 2019-02-08 14:13:00 | 16.95 | 14665 | 2019 | 2 | 2019-02 | 6 | 2019-Week-06 | 2019-02-08 | 4 | Friday | -16.95 |
77597 | C542742 | 22821 | GIFT BAG PSYCHEDELIC APPLES | -12 | 2019-01-29 16:26:00 | 0.65 | 15358 | 2019 | 1 | 2019-01 | 5 | 2019-Week-05 | 2019-01-29 | 1 | Tuesday | -7.80 |
64407 | 541604 | 22423 | REGENCY CAKESTAND 3 TIER | 1 | 2019-01-17 17:23:00 | 12.75 | 14572 | 2019 | 1 | 2019-01 | 3 | 2019-Week-03 | 2019-01-17 | 3 | Thursday | 12.75 |
# checking the share of sales from mutually exclusive entries
=True, frame_len=45) share_evaluation(sales_excl, df_ecom, show_qty_rev
=============================================
Evaluation of share: sales_excl
in df_ecom
---------------------------------------------
Number of entries: 3139 (0.6% of all entries)
Quantity: 228936 (4.4% of the total quantity)
Revenue: 454347.9 (4.7% of the total revenue)
=============================================
# checking the share of returns from mutually exclusive entries
=True, frame_len=45) share_evaluation(returns_excl, df_ecom, show_qty_rev
=============================================
Evaluation of share: returns_excl
in df_ecom
---------------------------------------------
Number of entries: 3139 (0.6% of all entries)
Quantity: -228936 (4.4% of the total quantity)
Revenue: -454347.9 (4.7% of the total revenue)
=============================================
# checking the share of mutually exclusive sales and returns
=True,
share_evaluation(sales_returns_excl, df_ecom, show_qty_rev=True) show_boxplots
======================================================================================================================================================
Evaluation of share: sales_returns_excl
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 6278 (1.2% of all entries)
Quantity: 0 (0.0% of the total quantity)
Revenue: 0.0 (0.0% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
======================================================================================================================================================
Let’s study the two most obvious outliers. We can also observe revenue outliers, we will study them in the next stage of Distribution Analysis (those outliers can be interconnected in fact).
'quantity > 20000 or quantity < -20000') df_ecom.query(
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
61619 | 541431 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 74215 | 2019-01-16 10:01:00 | 1.04 | 12346 | 2019 | 1 | 2019-01 | 3 | 2019-Week-03 | 2019-01-16 | 2 | Wednesday | 77183.60 |
61624 | C541433 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | -74215 | 2019-01-16 10:17:00 | 1.04 | 12346 | 2019 | 1 | 2019-01 | 3 | 2019-Week-03 | 2019-01-16 | 2 | Wednesday | -77183.60 |
540421 | 581483 | 23843 | PAPER CRAFT , LITTLE BIRDIE | 80995 | 2019-12-07 09:15:00 | 2.08 | 16446 | 2019 | 12 | 2019-12 | 49 | 2019-Week-49 | 2019-12-07 | 5 | Saturday | 168469.60 |
540422 | C581484 | 23843 | PAPER CRAFT , LITTLE BIRDIE | -80995 | 2019-12-07 09:27:00 | 2.08 | 16446 | 2019 | 12 | 2019-12 | 49 | 2019-Week-49 | 2019-12-07 | 5 | Saturday | -168469.60 |
'stock_code == "23166"') df_ecom.query(
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
61619 | 541431 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 74215 | 2019-01-16 10:01:00 | 1.04 | 12346 | 2019 | 1 | 2019-01 | 3 | 2019-Week-03 | 2019-01-16 | 2 | Wednesday | 77183.60 |
61624 | C541433 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | -74215 | 2019-01-16 10:17:00 | 1.04 | 12346 | 2019 | 1 | 2019-01 | 3 | 2019-Week-03 | 2019-01-16 | 2 | Wednesday | -77183.60 |
186770 | 552882 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 96 | 2019-05-10 10:10:00 | 1.04 | 14646 | 2019 | 5 | 2019-05 | 19 | 2019-Week-19 | 2019-05-10 | 4 | Friday | 99.84 |
187196 | 552953 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 4 | 2019-05-10 12:11:00 | 1.25 | 16745 | 2019 | 5 | 2019-05 | 19 | 2019-Week-19 | 2019-05-10 | 4 | Friday | 5.00 |
187718 | 553005 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 5 | 2019-05-10 16:29:00 | 1.25 | 14651 | 2019 | 5 | 2019-05 | 19 | 2019-Week-19 | 2019-05-10 | 4 | Friday | 6.25 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
533742 | 581108 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 2 | 2019-12-05 12:16:00 | 1.25 | 15984 | 2019 | 12 | 2019-12 | 49 | 2019-Week-49 | 2019-12-05 | 3 | Thursday | 2.50 |
536248 | 581219 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 1 | 2019-12-06 09:28:00 | 2.46 | 0 | 2019 | 12 | 2019-12 | 49 | 2019-Week-49 | 2019-12-06 | 4 | Friday | 2.46 |
539776 | 581439 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 2 | 2019-12-06 16:30:00 | 2.46 | 0 | 2019 | 12 | 2019-12 | 49 | 2019-Week-49 | 2019-12-06 | 4 | Friday | 4.92 |
540301 | 581476 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 48 | 2019-12-07 08:48:00 | 1.04 | 12433 | 2019 | 12 | 2019-12 | 49 | 2019-Week-49 | 2019-12-07 | 5 | Saturday | 49.92 |
541101 | 581492 | 23166 | MEDIUM CERAMIC TOP STORAGE JAR | 2 | 2019-12-07 10:03:00 | 2.46 | 0 | 2019 | 12 | 2019-12 | 49 | 2019-Week-49 | 2019-12-07 | 5 | Saturday | 4.92 |
260 rows × 16 columns
'stock_code == "23843"') df_ecom.query(
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
540421 | 581483 | 23843 | PAPER CRAFT , LITTLE BIRDIE | 80995 | 2019-12-07 09:15:00 | 2.08 | 16446 | 2019 | 12 | 2019-12 | 49 | 2019-Week-49 | 2019-12-07 | 5 | Saturday | 168469.60 |
540422 | C581484 | 23843 | PAPER CRAFT , LITTLE BIRDIE | -80995 | 2019-12-07 09:27:00 | 2.08 | 16446 | 2019 | 12 | 2019-12 | 49 | 2019-Week-49 | 2019-12-07 | 5 | Saturday | -168469.60 |
Observations
Decisions
# creating a DataFrame, displaying number of invoices per each stock code in the original DataFrame
= df_ecom.groupby('stock_code')['invoice_no'].count().reset_index()
df_ecom_stock_codes_number 2) df_ecom_stock_codes_number.head(
stock_code | invoice_no | |
---|---|---|
0 | 10002 | 71 |
1 | 10080 | 23 |
# creating a DataFrame, displaying number of invoices per each stock code within the DataFrame of mutually exclusive entries
= sales_returns_excl.groupby('stock_code')['invoice_no'].count().reset_index()
sales_returns_excl_stock_codes_number 2) sales_returns_excl_stock_codes_number.head(
stock_code | invoice_no | |
---|---|---|
0 | 10133 | 2 |
1 | 15034 | 4 |
# merging DataFrames
= (
stock_codes_number_merged
df_ecom_stock_codes_number.merge(sales_returns_excl_stock_codes_number, = "inner",
how = 'stock_code',
on = ('_df_ecom', '_meo')))
suffixes stock_codes_number_merged
stock_code | invoice_no_df_ecom | invoice_no_meo | |
---|---|---|---|
0 | 10133 | 198 | 2 |
1 | 15034 | 142 | 4 |
2 | 15036 | 523 | 4 |
3 | 15039 | 148 | 2 |
4 | 15056BL | 326 | 6 |
... | ... | ... | ... |
1382 | C2 | 143 | 4 |
1383 | DOT | 709 | 2 |
1384 | M | 566 | 94 |
1385 | POST | 1252 | 44 |
1386 | S | 62 | 2 |
1387 rows × 3 columns
# checking the stock codes that have equal number of invoices in the original DataFrame and in the mutually exclusive entries DataFrame
= stock_codes_number_merged.query('invoice_no_df_ecom == invoice_no_meo')
stock_codes_outliers
stock_codes_outliers
= stock_codes_outliers['stock_code'].to_list()
stock_codes_outliers_list
stock_codes_outliers_list
'stock_code in @stock_codes_outliers_list') df_ecom.query(
stock_code | invoice_no_df_ecom | invoice_no_meo | |
---|---|---|---|
213 | 21667 | 2 | 2 |
1113 | 23595 | 2 | 2 |
1118 | 23843 | 2 | 2 |
['21667', '23595', '23843']
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
54363 | 540943 | 21667 | GLASS CAKE COVER AND PLATE | 2 | 2019-01-10 12:31:00 | 14.95 | 17841 | 2019 | 1 | 2019-01 | 2 | 2019-Week-02 | 2019-01-10 | 3 | Thursday | 29.90 |
58588 | C541254 | 21667 | GLASS CAKE COVER AND PLATE | -2 | 2019-01-14 13:53:00 | 14.95 | 17841 | 2019 | 1 | 2019-01 | 3 | 2019-Week-03 | 2019-01-14 | 0 | Monday | -29.90 |
417107 | 572614 | 23595 | adjustment | 5 | 2019-10-23 11:38:00 | 0.00 | 0 | 2019 | 10 | 2019-10 | 43 | 2019-Week-43 | 2019-10-23 | 2 | Wednesday | 0.00 |
417108 | 572615 | 23595 | re-adjustment | -5 | 2019-10-23 11:39:00 | 0.00 | 0 | 2019 | 10 | 2019-10 | 43 | 2019-Week-43 | 2019-10-23 | 2 | Wednesday | -0.00 |
540421 | 581483 | 23843 | PAPER CRAFT , LITTLE BIRDIE | 80995 | 2019-12-07 09:15:00 | 2.08 | 16446 | 2019 | 12 | 2019-12 | 49 | 2019-Week-49 | 2019-12-07 | 5 | Saturday | 168469.60 |
540422 | C581484 | 23843 | PAPER CRAFT , LITTLE BIRDIE | -80995 | 2019-12-07 09:27:00 | 2.08 | 16446 | 2019 | 12 | 2019-12 | 49 | 2019-Week-49 | 2019-12-07 | 5 | Saturday | -168469.60 |
Observations
Entries with negative quantity account for 1.8% of all entries, 8.4% of the total quantity, and 9.2% of the total revenue.
1.2% of all entries are mutually exclusive, with half representing positive numbers and half negative numbers of quantity and revenue, their sum is zero.
Returns (defined as the negative part of mutually exclusive entries) represent 0.6% of all entries, 4.4% of the total quantity, and 4.7% of the total revenue.
Entries with negative quantity and returns are intersecting sets, where returns form a smaller subset. The difference between them can be explained by discounts, manual corrections, and extra fees and charges from marketplaces and banks not covered by return entries.
It’s important to note that mutually exclusive entries may exist for both actual returned products and errors in order placement corrected by such operations. It’s extremely difficult or sometimes even impossible to distinguish between these cases.
Meanwhile, there are three stock codes represented by mutually exclusive pairs only. One of them is “23843”, which we have already seen; its extreme quantity entries suggest a mistake during order processing. Two other stock codes represent negligible volume of goods and probably indicate mistakes when placing orders.
Several outstanding outliers were revealed in quantity distribution (and accordingly revenue), represented by two pairs of entries that were mutually exclusive. Two of these entries refer to the “23843” stock code that we studied above.
Decisions
Handling mutually exclusive entries
We consider two possible approaches:
⚠ Final decision: For further product range analysis, we will retain sales data from mutually exclusive entries (positive quantity entries) and remove only returns (negative quantity entries from mutually exclusive entries). Thus we are prioritizing keeping sales data, that might be valuable for our main goal of product analysis. However, we will remove entries associated with extreme outliers and stock codes represented by mutually exclusive pairs only.
Plan for mutually exclusive entries
Clean out returns and keep corresponding sales when defining the best and worst-performing products.
Study returns separately to identify products with higher return frequencies and amounts.
Combine both analyses (product performance and return rate) for a comprehensive view:
Poorly performing products with high return rates are best candidates for removal from the assortment.
Products bringing major revenue with minor return rates are candidates for promotion and higher inventory management priority.
Products bringing major revenue with significant return rates require further analysis to determine if return rates can be addressed (preferably before investing in promotion of those products).
Other entries with negative quantities
Implementation of Decisions
# filtering out returns (negative part of mutually exclusive entries) from the original dataset and assigning a new filtered DataFrame
= lambda df: df.drop(index=returns_excl_ids)
operation = data_reduction(df_ecom, operation) df_ecom_no_returns
Number of entries cleaned out from the "df_ecom": 3139 (0.6%)
# cleaning out entries associated with main outliers that we consider mistakes in order placement
= lambda df: df.query('quantity < 20000 and quantity > -20000')
operation = data_reduction(df_ecom_no_returns, operation) df_ecom_no_returns
Number of entries cleaned out from the "df_ecom_no_returns": 2 (0.0%)
# cleaning out entries of stock codes represented only by mutually exclusive pairs
= lambda df: df.query('stock_code not in @stock_codes_outliers_list')
operation = data_reduction(df_ecom_no_returns, operation) df_ecom_no_returns
Number of entries cleaned out from the "df_ecom_no_returns": 2 (0.0%)
# checking the result
=True, frame_len=50, show_pie_charts=True) share_evaluation(df_ecom_no_returns, df_ecom, show_qty_rev
======================================================================================================================================================
Evaluation of share: df_ecom_no_returns
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 532042 (99.4% of all entries)
Quantity: 5249828 (101.4% of the total quantity)
Revenue: 9956795.9 (102.1% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
Note: Unable to create pie chart as "quantity_sum" in the "df_ecom_no_returns" (5249828) exceeds the total "quantity_sum" (5176109) in the "df_ecom".
Note: Unable to create pie chart as "revenue_sum" in the "df_ecom_no_returns" (9956796) exceeds the total "revenue_sum" (9748131) in the "df_ecom".
======================================================================================================================================================
Note: The higher quantity and revenue after cleaning are expected, since we removed negative entries.
Service operations, such as manual corrections, discounts, etc. can affect our further analyses. We will identify and examine these entries and their share of the total. If they are not crucial for our study, we will exclude them from the main dataset.
It was previously noted that stock codes related to service operations contain one letter. Let’s take a look at such stock codes.
# checking rows where the `stock_code` column consists of one letter
= df_ecom[df_ecom['stock_code'].str.len() == 1].reset_index()
service_operations = (service_operations.groupby('stock_code')['description'].value_counts()
service_operations_grouped ='count')
.reset_index(name='count', ascending=False))
.sort_values(by
service_operations_grouped
= set(service_operations_grouped['description'])
service_operations_descriptions service_operations_descriptions
stock_code | description | count | |
---|---|---|---|
2 | M | Manual | 566 |
1 | D | Discount | 77 |
3 | S | SAMPLES | 62 |
0 | B | Adjust bad debt | 1 |
4 | m | Manual | 1 |
{'Adjust bad debt', 'Discount', 'Manual', 'SAMPLES'}
# checking the share of service operations and their quantity and revenues by types
share_evaluation(service_operations, df_ecom, =True,
show_qty_rev=True, show_outliers=True, boxplots_parameter='description') show_boxplots
======================================================================================================================================================
Evaluation of share: service_operations
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 707 (0.1% of all entries)
Quantity: 1674 (0.0% of the total quantity)
Revenue: -66705.5 (0.7% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
======================================================================================================================================================
# studying service operations
for description in service_operations_descriptions:
= df_ecom.query('description == @description')
df = f'service operation: "{description}"'
title_extension
=True,
share_evaluation(df, df_ecom, title_extension, show_qty_rev=True, example_type='sample', example_limit=3)
show_exampleprint('\n')
======================================================================================================================================================
Evaluation of share: df
service operation: “Manual” in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 567 (0.1% of all entries)
Quantity: 2925 (0.1% of the total quantity)
Revenue: -69031.6 (0.7% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code description quantity invoice_date unit_price customer_id invoice_year invoice_month invoice_year_month \
66519 541808 M Manual 1 2019-01-19 14:51:00 10.00 16210 2019 1 2019-01
9575 537208 M Manual 4 2018-12-03 15:12:00 0.85 15889 2018 12 2018-12
333046 C566168 M Manual -1 2019-09-07 12:02:00 116.69 0 2019 9 2019-09
invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue
66519 3 2019-Week-03 2019-01-19 5 Saturday 10.00
9575 49 2018-Week-49 2018-12-03 0 Monday 3.40
333046 36 2019-Week-36 2019-09-07 5 Saturday -116.69
======================================================================================================================================================
======================================================================================================================================================
Evaluation of share: df
service operation: “Adjust bad debt” in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 1 (0.0% of all entries)
Quantity: 1 (0.0% of the total quantity)
Revenue: 11062.1 (0.1% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code description quantity invoice_date unit_price customer_id invoice_year invoice_month invoice_year_month \
299982 A563185 B Adjust bad debt 1 2019-08-10 14:50:00 11062.06 0 2019 8 2019-08
invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue
299982 32 2019-Week-32 2019-08-10 5 Saturday 11062.06
======================================================================================================================================================
======================================================================================================================================================
Evaluation of share: df
service operation: “Discount” in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 77 (0.0% of all entries)
Quantity: -1194 (0.0% of the total quantity)
Revenue: -5696.2 (0.1% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code description quantity invoice_date unit_price customer_id invoice_year invoice_month invoice_year_month \
317547 C564812 D Discount -1 2019-08-28 11:45:00 10.06 14527 2019 8 2019-08
280503 C561464 D Discount -1 2019-07-25 12:40:00 26.05 14527 2019 7 2019-07
479868 C577227 D Discount -1 2019-11-16 12:06:00 19.82 14527 2019 11 2019-11
invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue
317547 35 2019-Week-35 2019-08-28 2 Wednesday -10.06
280503 30 2019-Week-30 2019-07-25 3 Thursday -26.05
479868 46 2019-Week-46 2019-11-16 5 Saturday -19.82
======================================================================================================================================================
======================================================================================================================================================
Evaluation of share: df
service operation: “SAMPLES” in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 62 (0.0% of all entries)
Quantity: -58 (0.0% of the total quantity)
Revenue: -3039.6 (0.0% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code description quantity invoice_date unit_price customer_id invoice_year invoice_month invoice_year_month \
193449 C553531 S SAMPLES -1 2019-05-15 15:09:00 2.98 0 2019 5 2019-05
96699 C544581 S SAMPLES -1 2019-02-19 14:32:00 55.00 0 2019 2 2019-02
96689 C544580 S SAMPLES -1 2019-02-19 14:25:00 5.44 0 2019 2 2019-02
invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue
193449 20 2019-Week-20 2019-05-15 2 Wednesday -2.98
96699 8 2019-Week-08 2019-02-19 1 Tuesday -55.00
96689 8 2019-Week-08 2019-02-19 1 Tuesday -5.44
======================================================================================================================================================
From all the service operations listed above, manual operations have the most impact on revenue. Let’s check the major entries of that kind.
'description == "Manual"').sort_values(by='revenue').head(3)
df_ecom.query('description == "Manual"').sort_values(by='revenue').tail(3) df_ecom.query(
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
222681 | C556445 | M | Manual | -1 | 2019-06-08 15:31:00 | 38970.00 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | -38970.00 |
422375 | C573079 | M | Manual | -2 | 2019-10-25 14:15:00 | 4161.06 | 12536 | 2019 | 10 | 2019-10 | 43 | 2019-Week-43 | 2019-10-25 | 4 | Friday | -8322.12 |
173391 | C551699 | M | Manual | -1 | 2019-05-01 14:12:00 | 6930.00 | 16029 | 2019 | 5 | 2019-05 | 18 | 2019-Week-18 | 2019-05-01 | 2 | Wednesday | -6930.00 |
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
422351 | 573077 | M | Manual | 1 | 2019-10-25 14:13:00 | 4161.06 | 12536 | 2019 | 10 | 2019-10 | 43 | 2019-Week-43 | 2019-10-25 | 4 | Friday | 4161.06 |
422376 | 573080 | M | Manual | 1 | 2019-10-25 14:20:00 | 4161.06 | 12536 | 2019 | 10 | 2019-10 | 43 | 2019-Week-43 | 2019-10-25 | 4 | Friday | 4161.06 |
268028 | 560373 | M | Manual | 1 | 2019-07-16 12:30:00 | 4287.63 | 0 | 2019 | 7 | 2019-07 | 29 | 2019-Week-29 | 2019-07-16 | 1 | Tuesday | 4287.63 |
# checking entries of the customer with the most significant impact on revenue from manual corrections
'customer_id == "15098"').sort_values(by='invoice_date') df_ecom.query(
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
222670 | 556442 | 22502 | PICNIC BASKET WICKER SMALL | 60 | 2019-06-08 15:22:00 | 4.95 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | 297.00 |
222680 | 556444 | 22502 | PICNIC BASKET WICKER 60 PIECES | 60 | 2019-06-08 15:28:00 | 649.50 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | 38970.00 |
222681 | C556445 | M | Manual | -1 | 2019-06-08 15:31:00 | 38970.00 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | -38970.00 |
222682 | 556446 | 22502 | PICNIC BASKET WICKER 60 PIECES | 1 | 2019-06-08 15:33:00 | 649.50 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | 649.50 |
222692 | C556448 | 22502 | PICNIC BASKET WICKER SMALL | -60 | 2019-06-08 15:39:00 | 4.95 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | -297.00 |
Let’s check sales, negative entries and mutually exclusive entries of the same customer.
'customer_id == "15098"') sales_df.query(
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | quantity_abs | id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
222670 | 556442 | 22502 | PICNIC BASKET WICKER SMALL | 60 | 2019-06-08 15:22:00 | 4.95 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | 297.00 | 60 | 222670 |
222680 | 556444 | 22502 | PICNIC BASKET WICKER 60 PIECES | 60 | 2019-06-08 15:28:00 | 649.50 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | 38970.00 | 60 | 222680 |
222682 | 556446 | 22502 | PICNIC BASKET WICKER 60 PIECES | 1 | 2019-06-08 15:33:00 | 649.50 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | 649.50 | 1 | 222682 |
'customer_id == "15098"') negative_qty_df.query(
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | quantity_abs | id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
222681 | C556445 | M | Manual | -1 | 2019-06-08 15:31:00 | 38970.00 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | -38970.00 | 1 | 222681 |
222692 | C556448 | 22502 | PICNIC BASKET WICKER SMALL | -60 | 2019-06-08 15:39:00 | 4.95 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | -297.00 | 60 | 222692 |
'customer_id == "15098"') sales_returns_excl.query(
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
222670 | 556442 | 22502 | PICNIC BASKET WICKER SMALL | 60 | 2019-06-08 15:22:00 | 4.95 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | 297.00 |
222692 | C556448 | 22502 | PICNIC BASKET WICKER SMALL | -60 | 2019-06-08 15:39:00 | 4.95 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | -297.00 |
Observations
Service operations represent both positive and negative quantity and revenue numbers. In summary they account for just 0.1% of all entries, less than 0.1% of the total quantity, and 0.7% of the total revenue.
💡 The study reveals an important insight: returns and order placement corrections can be registered both with and without proper stock codes. This registration can occur using Manual operations, making it difficult to differentiate between such cases.
Decisions
Implementation of Decisions
# filtering out service operations
= lambda df: df.query('description not in @service_operations_descriptions')
operation = data_reduction(df_ecom_no_returns, operation) df_ecom_no_returns_no_operations
Number of entries cleaned out from the "df_ecom_no_returns": 659 (0.1%)
# checking the result
=True, frame_len=80) share_evaluation(df_ecom_no_returns_no_operations, df_ecom_no_returns, show_qty_rev
================================================================================
Evaluation of share: df_ecom_no_returns_no_operations
in df_ecom_no_returns
--------------------------------------------------------------------------------
Number of entries: 531383 (99.9% of all entries)
Quantity: 5247959 (100.0% of the total quantity)
Revenue: 9986809.2 (100.3% of the total revenue)
================================================================================
Let’s extract remaining operations, assuming that they are represented by stock codes without numbers (as normal stock codes are) and with more than one symbol (unlike basic service operations defined prior). We faced such operations when studying data samples before, and seems there must be a pattern.
# defining the entries with negative quantity, excluding returns from mutually exclusive entries
= negative_qty_df.drop(index=returns_excl_ids) negative_qty_no_returns
# checking the nature of entries with negative quantity excluding returns from mutually exclusive entries
= (negative_qty_no_returns.groupby(['stock_code'])
negative_qty_no_returns_by_stock_code 'quantity':'sum', 'revenue':'sum'})
.agg({
.reset_index()='revenue')
.sort_values(by
)10) negative_qty_no_returns_by_stock_code.head(
stock_code | quantity | revenue | |
---|---|---|---|
1647 | AMAZONFEE | -30 | -221520.50 |
1656 | M | -3872 | -110125.38 |
1649 | CRUK | -16 | -7933.43 |
1648 | BANK CHARGES | -25 | -7340.64 |
1650 | D | -1194 | -5696.22 |
607 | 22423 | -513 | -5186.40 |
1298 | 47566B | -2671 | -3490.60 |
1658 | S | -59 | -3069.65 |
1657 | POST | -111 | -2948.54 |
482 | 22191 | -332 | -2551.70 |
# defining a regex pattern to match stock codes without numbers and with more than one symbol
= ~negative_qty_no_returns_by_stock_code['stock_code'].str.contains(r'[0-9]') & (negative_qty_no_returns_by_stock_code['stock_code'].str.len() > 1)
mask_regex = set(negative_qty_no_returns_by_stock_code[mask_regex]['stock_code'])
other_service_stock_codes other_service_stock_codes
{'AMAZONFEE', 'BANK CHARGES', 'CRUK', 'POST'}
# checking the other service operations
= df_ecom.query('stock_code in @other_service_stock_codes')
other_service_operations
share_evaluation(other_service_operations, df_ecom, =True,
show_qty_rev=True, boxplots_parameter='description',
show_boxplots=True) show_example
======================================================================================================================================================
Evaluation of share: other_service_operations
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 1339 (0.3% of all entries)
Quantity: 2944 (0.1% of the total quantity)
Revenue: -170398.9 (1.7% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code description quantity invoice_date unit_price customer_id invoice_year invoice_month invoice_year_month \
197025 553885 POST POSTAGE 1 2019-05-17 15:41:00 18.00 12601 2019 5 2019-05
231083 557232 POST POSTAGE 2 2019-06-15 14:28:00 18.00 12463 2019 6 2019-06
16356 C537651 AMAZONFEE AMAZON FEE -1 2018-12-05 15:49:00 13541.33 0 2018 12 2018-12
527349 580705 POST POSTAGE 5 2019-12-03 16:28:00 1.00 12683 2019 12 2019-12
385284 570191 POST POSTAGE 1 2019-10-05 15:23:00 15.00 12778 2019 10 2019-10
invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue
197025 20 2019-Week-20 2019-05-17 4 Friday 18.00
231083 24 2019-Week-24 2019-06-15 5 Saturday 36.00
16356 49 2018-Week-49 2018-12-05 2 Wednesday -13541.33
527349 49 2019-Week-49 2019-12-03 1 Tuesday 5.00
385284 40 2019-Week-40 2019-10-05 5 Saturday 15.00
======================================================================================================================================================
# checking descriptive statistics and summary of quantity and revenue for the other service operations
= other_service_operations.groupby('description')[['quantity','revenue']]
other_service_operations_grouped
other_service_operations_grouped.describe().Tsum() other_service_operations_grouped.
description | AMAZON FEE | Bank Charges | CRUK Commission | POSTAGE | |
---|---|---|---|---|---|
quantity | count | 34.00 | 37.00 | 16.00 | 1252.00 |
mean | -0.88 | -0.35 | -1.00 | 2.40 | |
std | 0.48 | 0.95 | 0.00 | 2.35 | |
min | -1.00 | -1.00 | -1.00 | -4.00 | |
25% | -1.00 | -1.00 | -1.00 | 1.00 | |
50% | -1.00 | -1.00 | -1.00 | 2.00 | |
75% | -1.00 | 1.00 | -1.00 | 3.00 | |
max | 1.00 | 1.00 | -1.00 | 21.00 | |
revenue | count | 34.00 | 37.00 | 16.00 | 1252.00 |
mean | -6515.31 | -193.94 | -495.84 | 52.90 | |
std | 5734.37 | 278.40 | 364.16 | 332.57 | |
min | -17836.46 | -1050.15 | -1100.44 | -8142.75 | |
25% | -7322.69 | -366.27 | -668.98 | 18.00 | |
50% | -5876.79 | -82.73 | -471.77 | 36.00 | |
75% | -4737.99 | 15.00 | -284.25 | 72.00 | |
max | 13541.33 | 15.00 | -1.60 | 8142.75 |
quantity | revenue | |
---|---|---|
description | ||
AMAZON FEE | -30 | -221520.50 |
Bank Charges | -13 | -7175.64 |
CRUK Commission | -16 | -7933.43 |
POSTAGE | 3003 | 66230.64 |
Observations
0.3% of entries, 0.1% of quantity, and 1.7% of revenues (negative revenue value in total) come from Other Service Operations (like bank charges, marketplaces fees, postage entries and other commissions).
Most service operations represent both positive and negative numbers of quantity and revenue (thus would be counted as sales if not cleaned out from the dataset). The major negative revenue value of -221k in total comes from AMAZONFEE description, and the major positive revenue value of 66k in total comes from POSTAGE description.
There is no obvious connection between service operations and specific items sold.
Previously, we observed that “POST” stock code appeared in mutually exclusive entries, which can be explained by chargeback of delivery-related expenses in case of returned products. Given the insignificant share and impact of such operations, we won’t investigate this aspect further.
Decisions
Implementation of Decisions
# exclude entries with service operations
= lambda df: df.query('stock_code not in @other_service_stock_codes')
operation = data_reduction(df_ecom_no_returns_no_operations, operation) df_ecom_no_returns_no_any_operations
Number of entries cleaned out from the "df_ecom_no_returns_no_operations": 1315 (0.2%)
# checking the result
=True) share_evaluation(df_ecom_no_returns_no_any_operations, df_ecom_no_returns_no_operations, show_qty_rev
======================================================================================================================================================
Evaluation of share: df_ecom_no_returns_no_any_operations
in df_ecom_no_returns_no_operations
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 530068 (99.8% of all entries)
Quantity: 5244977 (99.9% of the total quantity)
Revenue: 10134524.3 (101.5% of the total revenue)
======================================================================================================================================================
We previously faced examples of stock codes that have more than one description, where for example one description looks like a normal product name and the other description seems to correspond to some issues, as it contains words like “damages”, “found”, etc. Furthermore, there may be instances where the same description is associated with different stock codes.
Let’s examine such cases and their significance.
# creating a DataFrame of stock codes associated with numerous descriptions
= (
stock_codes_multiple_descriptions 'stock_code')['description'].nunique()
df_ecom_filtered.groupby(=False)
.sort_values(ascending
.reset_index()'description > 1'))
.query(
stock_codes_multiple_descriptions
# creating a set of stock codes associated with numerous description
= set(stock_codes_multiple_descriptions['stock_code']) stock_codes_multiple_descriptions_set
stock_code | description | |
---|---|---|
0 | 20713 | 8 |
1 | 21830 | 6 |
2 | 23084 | 6 |
3 | 85172 | 5 |
4 | 23131 | 5 |
... | ... | ... |
637 | 23502 | 2 |
638 | 22176 | 2 |
639 | 22351 | 2 |
640 | 81950V | 2 |
641 | 23028 | 2 |
642 rows × 2 columns
# creating a DataFrame of descriptions associated with numerous stock codes
= (
descriptions_multiple_stock_codes 'description')['stock_code'].nunique()
df_ecom_filtered.groupby(=False)
.sort_values(ascending
.reset_index()'stock_code > 1'))
.query(
descriptions_multiple_stock_codes
# checking the full set of descriptions associated with numerous stock codes
= set(descriptions_multiple_stock_codes['description']) descriptions_multiple_stock_codes_set
description | stock_code | |
---|---|---|
0 | check | 146 |
1 | ? | 47 |
2 | damaged | 43 |
3 | damages | 43 |
4 | found | 25 |
... | ... | ... |
162 | SUNSET CHECK HAMMOCK | 2 |
163 | Dotcom sales | 2 |
164 | PINK HAWAIIAN PICNIC HAMPER FOR 2 | 2 |
165 | TEATIME FUNKY FLOWER BACKPACK FOR 2 | 2 |
166 | SCANDINAVIAN REDS RIBBONS | 2 |
167 rows × 2 columns
# checking the description associated with the most different stock codes and corresponding entries
= descriptions_multiple_stock_codes['description'].iloc[0]
first_description = descriptions_multiple_stock_codes['stock_code'].iloc[0]
first_description_stock_codes_number
print(f'\n\033[1mDescription having the highest number of different stock codes ({first_description_stock_codes_number}):\033[0m \"{first_description}\"\n')
print(f'\033[1mRandom entries of \"{first_description}\" description:\033[0m')
'description == @first_description').sample(3, random_state=7) df_ecom_filtered.query(
Description having the highest number of different stock codes (146): "check"
Random entries of "check" description:
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
468299 | 576414 | 17012F | check | 14 | 2019-11-13 11:21:00 | 0.00 | 0 | 2019 | 11 | 2019-11 | 46 | 2019-Week-46 | 2019-11-13 | 2 | Wednesday | 0.00 |
502118 | 578837 | 35833P | check | -24 | 2019-11-23 15:51:00 | 0.00 | 0 | 2019 | 11 | 2019-11 | 47 | 2019-Week-47 | 2019-11-23 | 5 | Saturday | -0.00 |
432117 | 573815 | 20902 | check | -3 | 2019-10-30 11:31:00 | 0.00 | 0 | 2019 | 10 | 2019-10 | 44 | 2019-Week-44 | 2019-10-30 | 2 | Wednesday | -0.00 |
# checking the share of data with stock codes associated with numerous descriptions
= df_ecom_filtered.query('stock_code in @stock_codes_multiple_descriptions_set').sort_values(by='stock_code')
stock_codes_multiple_descriptions_entries
=True,
share_evaluation(stock_codes_multiple_descriptions_entries, df_ecom_filtered, show_qty_rev=True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code', 'nunique'): 'Stock Codes Coverage'},
(=True,
show_pie_charts_notes=True, example_type='head', example_limit=3) show_example
======================================================================================================================================================
Evaluation of share: stock_codes_multiple_descriptions_entries
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 109864 (20.8% of all entries)
Quantity: 1100000 (21.0% of the total quantity)
Revenue: 2532006.0 (25.5% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into stock_codes_multiple_descriptions_entries
.df_ecom_filtered
is generated in stock_codes_multiple_descriptions_entries
.df_ecom_filtered
occurs in stock_codes_multiple_descriptions_entries
. Every entry is counted separately, even if they are associated with the same order.stock_codes_multiple_descriptions_entries
, it still counts as one full unique order in this chart.stock_codes_multiple_descriptions_entries
, it still counts as one full unique stock code in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
Top rows:
invoice_no stock_code description quantity invoice_date unit_price customer_id invoice_year invoice_month \
487604 577773 10080 GROOVY CACTUS INFLATABLE 1 2019-11-19 15:57:00 0.39 16712 2019 11
488216 577801 10080 GROOVY CACTUS INFLATABLE 26 2019-11-19 17:04:00 0.39 17629 2019 11
460365 575908 10080 GROOVY CACTUS INFLATABLE 24 2019-11-09 15:54:00 0.39 13091 2019 11
invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue
487604 2019-11 47 2019-Week-47 2019-11-19 1 Tuesday 0.39
488216 2019-11 47 2019-Week-47 2019-11-19 1 Tuesday 10.14
460365 2019-11 45 2019-Week-45 2019-11-09 5 Saturday 9.36
======================================================================================================================================================
'revenue>35000') stock_codes_multiple_descriptions_entries.query(
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
222680 | 556444 | 22502 | PICNIC BASKET WICKER 60 PIECES | 60 | 2019-06-08 15:28:00 | 649.50 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | 38970.00 |
# checking the share of data with descriptions associated with numerous stock codes
= df_ecom_filtered.query('description in @descriptions_multiple_stock_codes_set').sort_values(by='description')
descriptions_multiple_stock_codes_entries
=True,
share_evaluation(descriptions_multiple_stock_codes_entries, df_ecom_filtered, show_qty_rev=True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code', 'nunique'): 'Stock Codes Coverage'},
(=True,
show_pie_charts_notes=True, example_type='head', example_limit=3) show_example
======================================================================================================================================================
Evaluation of share: descriptions_multiple_stock_codes_entries
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 23530 (4.4% of all entries)
Quantity: 129841 (2.5% of the total quantity)
Revenue: 480264.1 (4.8% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into descriptions_multiple_stock_codes_entries
.df_ecom_filtered
is generated in descriptions_multiple_stock_codes_entries
.df_ecom_filtered
occurs in descriptions_multiple_stock_codes_entries
. Every entry is counted separately, even if they are associated with the same order.descriptions_multiple_stock_codes_entries
, it still counts as one full unique order in this chart.descriptions_multiple_stock_codes_entries
, it still counts as one full unique stock code in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
Top rows:
invoice_no stock_code description quantity invoice_date unit_price customer_id invoice_year invoice_month \
374454 569374 85034A 3 GARDENIA MORRIS BOXED CANDLES 1 2019-10-01 16:14:00 8.29 0 2019 10
19524 537867 85034A 3 GARDENIA MORRIS BOXED CANDLES 4 2018-12-06 16:48:00 4.25 16717 2018 12
98724 544684 85034A 3 GARDENIA MORRIS BOXED CANDLES 1 2019-02-20 16:32:00 8.29 0 2019 2
invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue
374454 2019-10 40 2019-Week-40 2019-10-01 1 Tuesday 8.29
19524 2018-12 49 2018-Week-49 2018-12-06 3 Thursday 17.00
98724 2019-02 8 2019-Week-08 2019-02-20 2 Wednesday 8.29
======================================================================================================================================================
# checking stock codes that have numerous descriptions, associated descriptions and scope (quantity, revenue, number of invoices)
= (
stock_codes_multiple_descriptions_summary 'stock_code', 'description'])
stock_codes_multiple_descriptions_entries.groupby(['quantity': 'sum', 'revenue': 'sum', 'invoice_no': 'count'})
.agg({
.reset_index()='stock_code'))
.sort_values(by
stock_codes_multiple_descriptions_summary
stock_code | description | quantity | revenue | invoice_no | |
---|---|---|---|---|---|
0 | 10080 | GROOVY CACTUS INFLATABLE | 303 | 119.09 | 22 |
1 | 10080 | check | 22 | 0.00 | 1 |
2 | 10133 | COLOURING PENCILS BROWN TUBE | 2856 | 1539.60 | 196 |
3 | 10133 | damaged | -82 | 0.00 | 1 |
4 | 15058A | BLUE POLKADOT GARDEN PARASOL | 197 | 1647.04 | 92 |
... | ... | ... | ... | ... | ... |
1445 | 90195A | check | -45 | 0.00 | 1 |
1446 | 90210D | PURPLE ACRYLIC FACETED BANGLE | 107 | 132.05 | 8 |
1447 | 90210D | check | -28 | 0.00 | 1 |
1448 | gift_0001_20 | Dotcomgiftshop Gift Voucher £20.00 | 10 | 167.05 | 9 |
1449 | gift_0001_20 | to push order througha s stock was | 10 | 0.00 | 1 |
1450 rows × 5 columns
# checking the full set of descriptions associated with numerous stock codes
descriptions_multiple_stock_codes_set
{'3 GARDENIA MORRIS BOXED CANDLES',
'3 WHITE CHOC MORRIS BOXED CANDLES',
'3D DOG PICTURE PLAYING CARDS',
'3D SHEET OF CAT STICKERS',
'3D SHEET OF DOG STICKERS',
'4 ROSE PINK DINNER CANDLES',
'4 SKY BLUE DINNER CANDLES',
'75 GREEN FAIRY CAKE CASES',
'75 GREEN PETIT FOUR CASES',
'?',
'??',
'???missing',
'?missing',
'ANT WHITE WIRE HEART SPIRAL',
'Adjustment',
'BISCUITS SMALL BOWL LIGHT BLUE',
'BLACK CHUNKY BEAD BRACELET W STRAP',
'BLACK DROP EARRINGS W LONG BEADS',
'BLACK ENCHANTED FOREST PLACEMAT',
'BLACK SQUARE TABLE CLOCK',
'BLACK STITCHED WALL CLOCK',
'BLACK/BLUE POLKADOT UMBRELLA',
'BLUE 3 PIECE POLKADOT CUTLERY SET',
'BRIGHT BLUES RIBBONS ',
'CHARLIE + LOLA BISCUITS TINS',
'CHARLIE AND LOLA FIGURES TINS',
'CHARLIE AND LOLA TABLE TINS',
'CHARLIE LOLA BLUE HOT WATER BOTTLE ',
'CHARLIE+LOLA RED HOT WATER BOTTLE ',
'CHECK',
'CHILDRENS CUTLERY POLKADOT BLUE',
'CHILDRENS CUTLERY POLKADOT GREEN ',
'CHILDRENS CUTLERY POLKADOT PINK',
'CHILDRENS CUTLERY RETROSPOT RED ',
'CHOCOLATE 1 WICK MORRIS BOX CANDLE',
'CHOCOLATE 3 WICK MORRIS BOX CANDLE',
'CHOCOLATE BOX RIBBONS ',
'CINAMMON SET OF 9 T-LIGHTS',
'COLOURING PENCILS BROWN TUBE',
'COLUMBIAN CANDLE RECTANGLE',
'COLUMBIAN CANDLE ROUND',
'DOORMAT BLACK FLOCK ',
'Damaged',
'Dotcom sales',
'EAU DE NILE JEWELLED PHOTOFRAME',
'EDWARDIAN PARASOL BLACK',
'EDWARDIAN PARASOL NATURAL',
'EDWARDIAN PARASOL PINK',
'ENAMEL PINK TEA CONTAINER',
'ENGLISH ROSE HOT WATER BOTTLE',
'ENGLISH ROSE NOTEBOOK A7 SIZE',
'FAIRY CAKE DESIGN UMBRELLA',
'FAIRY CAKE NOTEBOOK A5 SIZE',
'FAIRY CAKES NOTEBOOK A7 SIZE',
'FEATHER PEN,COAL BLACK',
'FRENCH FLORAL CUSHION COVER ',
'FRENCH LATTICE CUSHION COVER ',
'FROSTED WHITE BASE ',
'Found',
'GARDENIA 1 WICK MORRIS BOXED CANDLE',
'GARDENIA 3 WICK MORRIS BOXED CANDLE',
'GREEN 3 PIECE POLKADOT CUTLERY SET',
'GREEN BITTY LIGHT CHAIN',
'HANGING HEART ZINC T-LIGHT HOLDER',
'ICON PLACEMAT POP ART ELVIS',
'IVORY ENCHANTED FOREST PLACEMAT',
'JUMBO BAG STRAWBERRY',
'LUSH GREENS RIBBONS',
'METAL SIGN,CUPCAKE SINGLE HOOK',
'ORANGE SCENTED SET/9 T-LIGHTS',
'PAPER LANTERN 9 POINT SNOW STAR',
'PINK 3 PIECE POLKADOT CUTLERY SET',
'PINK FAIRY CAKE CHILDRENS APRON',
'PINK FAIRY CAKE CUSHION COVER',
'PINK FLOCK GLASS CANDLEHOLDER',
'PINK FLOWERS RABBIT EASTER',
'PINK HAPPY BIRTHDAY BUNTING',
'PINK HAWAIIAN PICNIC HAMPER FOR 2',
'PINK STITCHED WALL CLOCK',
'PORCELAIN BUTTERFLY OIL BURNER',
'RED 3 PIECE RETROSPOT CUTLERY SET',
'RED ENCHANTED FOREST PLACEMAT',
'RED RETROSPOT UMBRELLA',
'RETRO MOD TRAY',
"RETRO PLASTIC 70'S TRAY",
'RETRO PLASTIC DAISY TRAY',
'RETRO PLASTIC POLKA TRAY',
'ROMANTIC PINKS RIBBONS ',
'ROSE 3 WICK MORRIS BOX CANDLE',
'ROSE SCENT CANDLE IN JEWELLED BOX',
'ROUND BLUE CLOCK WITH SUCKER',
'S/4 PINK FLOWER CANDLES IN BOWL',
'SCANDINAVIAN REDS RIBBONS',
'SCOTTIE DOGS BABY BIB',
'SCOTTIES CHILDRENS APRON',
'SET 4 VALENTINE DECOUPAGE HEART BOX',
'SET OF 16 VINTAGE BLACK CUTLERY',
'SET OF 16 VINTAGE RED CUTLERY',
'SET OF 16 VINTAGE ROSE CUTLERY',
'SET OF 16 VINTAGE SKY BLUE CUTLERY',
'SET OF 4 ENGLISH ROSE COASTERS',
'SET OF 4 ENGLISH ROSE PLACEMATS',
'SET OF 4 FAIRY CAKE PLACEMATS',
'SET OF 4 FAIRY CAKE PLACEMATS ',
'SET OF 4 GREEN CAROUSEL COASTERS',
'SET OF 4 POLKADOT COASTERS',
'SET OF 4 POLKADOT PLACEMATS ',
'SET/3 OCEAN SCENT CANDLE JEWEL BOX',
'SET/3 ROSE CANDLE IN JEWELLED BOX',
'SET/3 VANILLA SCENTED CANDLE IN BOX',
'SET/4 RED MINI ROSE CANDLE IN BOWL',
'SET/6 PURPLE BUTTERFLY T-LIGHTS',
'SET/6 TURQUOISE BUTTERFLY T-LIGHTS',
'SILVER RECORD COVER FRAME',
'SINGLE HEART ZINC T-LIGHT HOLDER',
'SMALL CHOCOLATES PINK BOWL',
'SMALL DOLLY MIX DESIGN ORANGE BOWL',
'SMALL LICORICE DES PINK BOWL',
'SMALL MARSHMALLOWS PINK BOWL',
'SQUARE CHERRY BLOSSOM CABINET',
'STORAGE TIN VINTAGE LEAF',
'SUNSET CHECK HAMMOCK',
'TEA TIME OVEN GLOVE',
'TEA TIME PARTY BUNTING',
'TEA TIME TABLE CLOTH',
'TEATIME FUNKY FLOWER BACKPACK FOR 2',
'TRADITIONAL CHRISTMAS RIBBONS',
'Unsaleable, destroyed.',
'VANILLA SCENT CANDLE JEWELLED BOX',
'VINYL RECORD FRAME SILVER',
'WHITE BAMBOO RIBS LAMPSHADE',
'WHITE BIRD GARDEN DESIGN MUG',
'WHITE HANGING HEART T-LIGHT HOLDER',
'WHITE SQUARE TABLE CLOCK',
'WHITE STITCHED WALL CLOCK',
'WOODEN FRAME ANTIQUE WHITE ',
'WOVEN BERRIES CUSHION COVER ',
'WOVEN BUBBLE GUM CUSHION COVER',
'WOVEN CANDY CUSHION COVER ',
'WOVEN ROSE GARDEN CUSHION COVER ',
'adjustment',
'check',
'counted',
'crushed',
'damaged',
'damages',
'damages wax',
'damages?',
'dotcom',
'found',
'had been put aside',
'incorrect stock entry.',
'mailout',
'missing',
'mixed up',
'returned',
'reverse 21/5/10 adjustment',
'rusty throw away',
'smashed',
'sold as 1',
'sold as set on dotcom',
'stock check',
'test',
'thrown away',
'wet damaged',
'wet pallet',
'wet/rusty'}
We see normal descriptions of products, as well as odd ones for example related to issues with packaging or inventory. We could apply regex filters or even use ML to clean out unusual descriptions, but since the list is pretty short, manual filtering will be faster with more accurate results.
Furthermore, we can see that some different descriptions seem to be describing the same product in essence, just differently written (e.g. “SET OF 4 FAIRY CAKE PLACEMATS” and “SET OF 4 FAIRY CAKE PLACEMATS” - with an extra space at the end. We will study such cases in the next step.
# defining a set of unusual descriptions (associated with numerous stock codes)
= {'?',
unusual_descriptions '??',
'???missing',
'?missing',
'Adjustment',
'CHECK',
'Damaged',
'Dotcom sales',
'Found',
'Unsaleable, destroyed.',
'adjustment',
'check',
'counted',
'crushed',
'damaged',
'damages',
'damages wax',
'damages?',
'dotcom',
'found',
'had been put aside',
'incorrect stock entry.',
'mailout',
'missing',
'mixed up',
'returned',
'reverse 21/5/10 adjustment',
'rusty throw away',
'smashed',
'sold as 1',
'sold as set on dotcom',
'stock check',
'test',
'thrown away',
'wet damaged',
'wet pallet',
'wet/rusty'}
Let’s also check unusual descriptions discovered above when grouping by stock codes and than filtering out descriptions having lower case written letters.
# checking descriptions related to stock codes that have more than one description
= df_ecom_filtered.query('stock_code in @stock_codes_multiple_descriptions_set')['description'].value_counts().reset_index()
multiple_descriptions_count = ['description', 'count']
multiple_descriptions_count.columns multiple_descriptions_count
description | count | |
---|---|---|
0 | WHITE HANGING HEART T-LIGHT HOLDER | 2278 |
1 | REGENCY CAKESTAND 3 TIER | 2143 |
2 | LUNCH BAG RED RETROSPOT | 1612 |
3 | ASSORTED COLOUR BIRD ORNAMENT | 1483 |
4 | SPOTTY BUNTING | 1166 |
... | ... | ... |
1026 | ?display? | 1 |
1027 | crushed ctn | 1 |
1028 | MINT DINER CLOCK | 1 |
1029 | samples/damages | 1 |
1030 | SET/5 RED SPOTTY LID GLASS BOWLS | 1 |
1031 rows × 2 columns
We see atypical descriptions left, they are written in lowercase only unlike normal product-related descriptions. Let’s check the other descriptions that contain lowercase letters.
= sorted(
multiple_descriptions_has_lowercase list(
'description'].str.contains('[a-z]')]
multiple_descriptions_count[multiple_descriptions_count['description'].unique()))
[
multiple_descriptions_has_lowercase
['20713 wrongly marked',
'3 TRADITIONAl BISCUIT CUTTERS SET',
'? sold as sets?',
'?? missing',
'????damages????',
'????missing',
'???lost',
'???missing',
'?display?',
'?lost',
'?missing',
'?sold as sets?',
'Adjustment',
'Breakages',
'Crushed',
'Dagamed',
'Damaged',
'Damages',
'Damages/samples',
'Display',
'Dotcom sales',
'Dotcom set',
"Dotcom sold in 6's",
'Dotcomgiftshop Gift Voucher £20.00',
'Found',
'Found in w/hse',
'Given away',
'Had been put aside.',
'Incorrect stock entry.',
'John Lewis',
'Lighthouse Trading zero invc incorr',
'Marked as 23343',
'Missing',
'Not rcvd in 10/11/2010 delivery',
'OOPS ! adjustment',
'POLYESTER FILLER PAD 30CMx30CM',
'POLYESTER FILLER PAD 40x40cm',
'POLYESTER FILLER PAD 45x45cm',
'Printing smudges/thrown away',
'Sale error',
'Show Samples',
'Sold as 1 on dotcom',
'THE KING GIFT BAG 25x24x12cm',
'Thrown away.',
'Unsaleable, destroyed.',
'Water damaged',
'Wet pallet-thrown away',
'Wrongly mrked had 85123a in box',
'add stock to allocate online orders',
'adjust',
'adjustment',
'alan hodge cant mamage this section',
'allocate stock for dotcom orders ta',
'barcode problem',
'broken',
'came coded as 20713',
"can't find",
'check',
'check?',
'code mix up? 84930',
'counted',
'cracked',
'crushed',
'crushed boxes',
'crushed ctn',
'damaged',
'damaged stock',
'damages',
'damages wax',
'damages/credits from ASOS.',
'damages/display',
'damages/dotcom?',
'damages/showroom etc',
'damages?',
'did a credit and did not tick ret',
'dotcom',
'dotcom adjust',
'dotcom sales',
'dotcom sold sets',
'dotcomstock',
'faulty',
'for online retail orders',
'found',
'found box',
'found some more on shelf',
'had been put aside',
'historic computer difference?....se',
'incorrect stock entry.',
'incorrectly credited C550456 see 47',
'incorrectly made-thrown away.',
'incorrectly put back into stock',
'label mix up',
'lost',
'lost in space',
'lost??',
'mailout',
'mailout ',
'michel oops',
'missing',
'missing?',
'mix up with c',
'mixed up',
'mouldy',
'mouldy, thrown away.',
'mouldy, unsaleable.',
'mystery! Only ever imported 1800',
'on cargo order',
'printing smudges/thrown away',
'rcvd be air temp fix for dotcom sit',
'returned',
'reverse 21/5/10 adjustment',
'rusty throw away',
'rusty thrown away',
'samples',
'samples/damages',
'showroom',
'smashed',
'sold as 1',
'sold as 22467',
'sold as set by dotcom',
'sold as set on dotcom',
'sold as set/6 by dotcom',
'sold in set?',
'sold with wrong barcode',
'stock check',
'stock creditted wrongly',
'taig adjust',
'taig adjust no stock',
'temp adjustment',
'test',
'thrown away',
'to push order througha s stock was ',
'water damage',
'water damaged',
'website fixed',
'wet',
'wet boxes',
'wet damaged',
'wet pallet',
'wet rusty',
'wet/rusty',
'wet?',
'wrong barcode',
'wrong barcode (22467)',
'wrong code',
'wrong code?',
'wrongly coded 20713',
'wrongly coded 23343',
'wrongly coded-23343',
'wrongly marked',
'wrongly marked 23343',
'wrongly marked carton 22804',
'wrongly marked. 23343 in box',
'wrongly sold (22719) barcode',
'wrongly sold as sets',
'wrongly sold sets']
“3 TRADITIONAl BISCUIT CUTTERS SET” appears in the list because it has a lowercase ‘l’ in “TRADITIONAl” - it’s written as “TRADITIONAl” instead of “TRADITIONAL”. Since it’s an ordinary product, we will get rid of it. Also there are products having measures in “cm” detected by rhe [a-z] pattern. We will clean out them as well.
= {'20713 wrongly marked',
unusual_descriptions2 '? sold as sets?',
'?? missing',
'????damages????',
'????missing',
'???lost',
'?display?',
'?lost',
'?sold as sets?',
'Breakages',
'Crushed',
'Dagamed',
'Damages',
'Damages/samples',
'Display',
'Dotcom',
'Dotcom set',
"Dotcom sold in 6's",
'Found in w/hse',
'Given away',
'Had been put aside.',
'Incorrect stock entry.',
'John Lewis',
'Lighthouse Trading zero invc incorr',
'Marked as 23343',
'Missing',
'Not rcvd in 10/11/2010 delivery',
'OOPS ! adjustment',
'Printing smudges/thrown away',
'Sale error',
'Show Samples',
'Sold as 1 on dotcom',
'Thrown away.',
'Water damaged',
'Wet pallet-thrown away',
'Wrongly mrked had 85123a in box',
'add stock to allocate online orders',
'adjust',
'alan hodge cant mamage this section',
'allocate stock for dotcom orders ta',
'barcode problem',
'broken',
'came coded as 20713',
"can't find",
'check?',
'code mix up? 84930',
'cracked',
'crushed boxes',
'crushed ctn',
'damaged stock',
'damages/credits from ASOS.',
'damages/display',
'damages/dotcom?',
'damages/showroom etc',
'did a credit and did not tick ret',
'dotcom adjust',
'dotcom sales',
'dotcom sold sets',
'dotcomstock',
'faulty',
'for online retail orders',
'found box',
'found some more on shelf',
'historic computer difference?....se',
'incorrectly credited C550456 see 47',
'incorrectly made-thrown away.',
'incorrectly put back into stock',
'label mix up',
'lost',
'lost in space',
'lost??',
'mailout ',
'michel oops',
'missing?',
'mix up with c',
'mouldy',
'mouldy, thrown away.',
'mouldy, unsaleable.',
'mystery! Only ever imported 1800',
'on cargo order',
'printing smudges/thrown away',
'rcvd be air temp fix for dotcom sit',
're dotcom quick fix.',
'reverse previous adjustment',
'rusty thrown away',
'samples',
'samples/damages',
'showroom',
'sold as 22467',
'sold as set by dotcom',
'sold as set/6 by dotcom',
'sold in set?',
'sold with wrong barcode',
'stock creditted wrongly',
'taig adjust',
'taig adjust no stock',
'temp adjustment',
'to push order througha s stock was ',
'water damage',
'water damaged',
'website fixed',
'wet',
'wet boxes',
'wet rusty',
'wet?',
'wrong barcode',
'wrong barcode (22467)',
'wrong code',
'wrong code?',
'wrongly coded 20713',
'wrongly coded 23343',
'wrongly coded-23343',
'wrongly marked',
'wrongly marked 23343',
'wrongly marked carton 22804',
'wrongly marked. 23343 in box',
'wrongly sold (22719) barcode',
'wrongly sold as sets',
'wrongly sold sets'}
# filtering elements that are in either of the sets but not in their intersection
= unusual_descriptions.symmetric_difference(unusual_descriptions2)
unusual_descriptions_overall
#checking the result
len(unusual_descriptions)
len(unusual_descriptions2)
len(unusual_descriptions_overall)
#sorted(unusual_descriptions_overall)
37
119
156
# defining unusual entries
= df_ecom_filtered.query('description in @unusual_descriptions_overall').sort_values(by='quantity')
unusual_entries unusual_entries
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
225530 | 556691 | 23005 | printing smudges/thrown away | -9600 | 2019-06-12 10:37:00 | 0.00 | 0 | 2019 | 6 | 2019-06 | 24 | 2019-Week-24 | 2019-06-12 | 2 | Wednesday | -0.00 |
225529 | 556690 | 23005 | printing smudges/thrown away | -9600 | 2019-06-12 10:37:00 | 0.00 | 0 | 2019 | 6 | 2019-06 | 24 | 2019-Week-24 | 2019-06-12 | 2 | Wednesday | -0.00 |
225528 | 556687 | 23003 | Printing smudges/thrown away | -9058 | 2019-06-12 10:36:00 | 0.00 | 0 | 2019 | 6 | 2019-06 | 24 | 2019-Week-24 | 2019-06-12 | 2 | Wednesday | -0.00 |
431381 | 573596 | 79323W | Unsaleable, destroyed. | -4830 | 2019-10-29 15:17:00 | 0.00 | 0 | 2019 | 10 | 2019-10 | 44 | 2019-Week-44 | 2019-10-29 | 1 | Tuesday | -0.00 |
263884 | 560039 | 20713 | wrongly marked. 23343 in box | -3100 | 2019-07-12 14:27:00 | 0.00 | 0 | 2019 | 7 | 2019-07 | 28 | 2019-Week-28 | 2019-07-12 | 4 | Friday | -0.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
203751 | 554550 | 47566B | incorrectly credited C550456 see 47 | 1300 | 2019-05-23 09:57:00 | 0.00 | 0 | 2019 | 5 | 2019-05 | 21 | 2019-Week-21 | 2019-05-23 | 3 | Thursday | 0.00 |
160541 | 550460 | 47556B | did a credit and did not tick ret | 1300 | 2019-04-16 13:18:00 | 0.00 | 0 | 2019 | 4 | 2019-04 | 16 | 2019-Week-16 | 2019-04-16 | 1 | Tuesday | 0.00 |
115807 | 546139 | 84988 | ? | 3000 | 2019-03-07 16:35:00 | 0.00 | 0 | 2019 | 3 | 2019-03 | 10 | 2019-Week-10 | 2019-03-07 | 3 | Thursday | 0.00 |
263885 | 560040 | 23343 | came coded as 20713 | 3100 | 2019-07-12 14:28:00 | 0.00 | 0 | 2019 | 7 | 2019-07 | 28 | 2019-Week-28 | 2019-07-12 | 4 | Friday | 0.00 |
220843 | 556231 | 85123A | ? | 4000 | 2019-06-07 15:04:00 | 0.00 | 0 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-07 | 4 | Friday | 0.00 |
591 rows × 16 columns
# checking the share of unusual entries
=True, show_boxplots=True) share_evaluation(unusual_entries, df_ecom, show_qty_rev
======================================================================================================================================================
Evaluation of share: unusual_entries
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 591 (0.1% of all entries)
Quantity: -121639 (2.4% of the total quantity)
Revenue: 0.0 (0.0% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
======================================================================================================================================================
# checking the 10 most popular unusual descriptions
= (unusual_entries.groupby('description')
most_popular_unusual_entries 'quantity':'sum','revenue':'sum', 'invoice_no':'count'}).
.agg({='invoice_no', ascending=False))
reset_index().sort_values(by= most_popular_unusual_entries.head(10)
ten_most_popular_unusual_entries
= 'description', show_qty_rev=True, show_boxplots=True) share_evaluation(ten_most_popular_unusual_entries, df_ecom, boxplots_parameter
======================================================================================================================================================
Evaluation of share: ten_most_popular_unusual_entries
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 10 (0.0% of all entries)
Quantity: -46758 (0.9% of the total quantity)
Revenue: 0.0 (0.0% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
======================================================================================================================================================
# defining unusual entries with positive and negative quantity
= unusual_entries.query('quantity < 0')
negative_qty_unusual_entries = unusual_entries.query('quantity >= 0') positive_qty_unusual_entries
# checking the share of unusual entries with positive quantity
=True) share_evaluation(positive_qty_unusual_entries, df_ecom, show_qty_rev
======================================================================================================================================================
Evaluation of share: positive_qty_unusual_entries
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 140 (0.0% of all entries)
Quantity: 22779 (0.4% of the total quantity)
Revenue: 0.0 (0.0% of the total revenue)
======================================================================================================================================================
# checking the share of unusual entries with negative quantity
= unusual_entries.query('quantity < 0')
negative_qty_unusual_entries =True) share_evaluation(negative_qty_unusual_entries, df_ecom, show_qty_rev
======================================================================================================================================================
Evaluation of share: negative_qty_unusual_entries
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 451 (0.1% of all entries)
Quantity: -144418 (2.8% of the total quantity)
Revenue: 0.0 (0.0% of the total revenue)
======================================================================================================================================================
# checking 10 most notable unusual descriptions with negative quantities
= (negative_qty_unusual_entries.groupby('description')
most_notable_negative_qty_unusual_entries 'quantity':'sum','revenue':'sum', 'invoice_no':'count'}).
.agg({='quantity'))
reset_index().sort_values(by
= most_notable_negative_qty_unusual_entries.head(10)
ten_most_notable_negative_qty_unusual_entries
= 'description', show_qty_rev=True, show_boxplots=True) share_evaluation(ten_most_notable_negative_qty_unusual_entries, df_ecom, boxplots_parameter
======================================================================================================================================================
Evaluation of share: ten_most_notable_negative_qty_unusual_entries
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 10 (0.0% of all entries)
Quantity: -90053 (1.7% of the total quantity)
Revenue: 0.0 (0.0% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
======================================================================================================================================================
Observations
We see that unusual descriptions serve as one more tool for adjusting order placement or physical issues (such as damaged or missed units).
These entries represent a minor value of just 0.1% of all entries, and 2.4% of the total quantity in summary (where most operation are of negative quantity).
The distinctive feature of such entries - they don’t affect revenues, while quantity is corrected. So we can conclude, that the general data on revenues is not 100% trustworthy.
The top ten most notable non-product related operations (unusual descriptions) are accounted for 1.7% of the total quantity loss but zero revenue loss mentioned.
Decisions - Remove unusual entries. It’s hard to discover what indeed stands behind such descriptions as e.g. “damaged” or “incorrect stock entry”, and these data appear to be of low value for product analysis.
Implementation of Decisions
# getting rid of unusual entries
= lambda df: df.query('description not in @unusual_descriptions_overall')
operation = data_reduction(df_ecom_filtered, operation) df_ecom_filtered
Number of entries cleaned out from the "df_ecom_filtered": 591 (0.1%)
Let’s check stock codes with multiple descriptions and vice versa after filtering out atypical descriptions.
# checking descriptions with multiple stock codes after filtering
= (
descriptions_multiple_stock_codes_filtered 'description')['stock_code'].nunique()
df_ecom_filtered.groupby(=False)
.sort_values(ascending
.reset_index()'stock_code > 1'))
.query(
descriptions_multiple_stock_codes_filtered
description | stock_code | |
---|---|---|
0 | METAL SIGN,CUPCAKE SINGLE HOOK | 6 |
1 | GREEN 3 PIECE POLKADOT CUTLERY SET | 2 |
2 | BLACK ENCHANTED FOREST PLACEMAT | 2 |
3 | JUMBO BAG STRAWBERRY | 2 |
4 | SET OF 16 VINTAGE BLACK CUTLERY | 2 |
... | ... | ... |
125 | 3 WHITE CHOC MORRIS BOXED CANDLES | 2 |
126 | EDWARDIAN PARASOL NATURAL | 2 |
127 | SET/3 OCEAN SCENT CANDLE JEWEL BOX | 2 |
128 | 3D SHEET OF DOG STICKERS | 2 |
129 | FRENCH FLORAL CUSHION COVER | 2 |
130 rows × 2 columns
# checking stock codes with multiple descriptions after filtering
= (
stock_codes_multiple_descriptions_filtered 'stock_code')['description'].nunique()
df_ecom_filtered.groupby(=False)
.sort_values(ascending
.reset_index()'description > 1'))
.query(
stock_codes_multiple_descriptions_filtered
stock_code | description | |
---|---|---|
0 | 23196 | 4 |
1 | 23236 | 4 |
2 | 23366 | 3 |
3 | 23209 | 3 |
4 | 17107D | 3 |
... | ... | ... |
224 | 35817P | 2 |
225 | 23028 | 2 |
226 | 23086 | 2 |
227 | 23253 | 2 |
228 | 23075 | 2 |
229 rows × 2 columns
# checking the result of filtering
= len(descriptions_multiple_stock_codes)
original_desc_count = len(descriptions_multiple_stock_codes_filtered)
filtered_desc_count = (filtered_desc_count / original_desc_count) * 100
desc_percent
= len(stock_codes_multiple_descriptions)
original_stock_count = len(stock_codes_multiple_descriptions_filtered)
filtered_stock_count = (filtered_stock_count / original_stock_count) * 100
stock_percent
print("="*100)
print(f'\033[1mDescriptions with multiple stock codes after filtering:\033[0m {filtered_desc_count:,} ({original_desc_count:,} originally, {desc_percent:.1f}% remaining)')
print(f'\033[1mStock codes with multiple descriptions after filtering:\033[0m {filtered_stock_count:,} ({original_stock_count:,} originally, {stock_percent:.1f}% remaining)')
print("="*100)
====================================================================================================
Descriptions with multiple stock codes after filtering: 130 (167 originally, 77.8% remaining)
Stock codes with multiple descriptions after filtering: 229 (642 originally, 35.7% remaining)
====================================================================================================
# checking stock codes of descriptions with multiple stock codes
= set(descriptions_multiple_stock_codes_filtered['description'])
descriptions_multiple_stock_codes_set_filtered = (
descriptions_multiple_stock_codes_summary_filtered 'description in @descriptions_multiple_stock_codes_set_filtered')
df_ecom_filtered.query('description')
.groupby('stock_code'].value_counts()
[='count'))
.reset_index(name
6) descriptions_multiple_stock_codes_summary_filtered.head(
description | stock_code | count | |
---|---|---|---|
0 | 3 GARDENIA MORRIS BOXED CANDLES | 85034A | 83 |
1 | 3 GARDENIA MORRIS BOXED CANDLES | 85034a | 3 |
2 | 3 WHITE CHOC MORRIS BOXED CANDLES | 85034B | 122 |
3 | 3 WHITE CHOC MORRIS BOXED CANDLES | 85034b | 1 |
4 | 3D DOG PICTURE PLAYING CARDS | 84558A | 82 |
5 | 3D DOG PICTURE PLAYING CARDS | 84558a | 5 |
# checking descriptions of stock codes with multiple descriptions
= set(stock_codes_multiple_descriptions_filtered['stock_code'])
stock_codes_multiple_descriptions_set_filtered = (
stock_codes_multiple_descriptions_summary_filtered 'stock_code in @stock_codes_multiple_descriptions_set_filtered')
df_ecom_filtered.query('stock_code')
.groupby('description'].value_counts()
[='count'))
.reset_index(name
6) stock_codes_multiple_descriptions_summary_filtered.head(
stock_code | description | count | |
---|---|---|---|
0 | 16156L | WRAP CAROUSEL | 14 |
1 | 16156L | WRAP, CAROUSEL | 4 |
2 | 17107D | FLOWER FAIRY,5 SUMMER B'DRAW LINERS | 25 |
3 | 17107D | FLOWER FAIRY 5 DRAWER LINERS | 20 |
4 | 17107D | FLOWER FAIRY 5 SUMMER DRAW LINERS | 1 |
5 | 20622 | VIPPASSPORT COVER | 34 |
Let’s check a share of total of remaining entries of stock codes with multiple descriptions.
= set(stock_codes_multiple_descriptions_summary_filtered['stock_code'])
stock_codes_multiple_descriptions_filtered_set = df_ecom_filtered.query('stock_code in @stock_codes_multiple_descriptions_set_filtered') stock_codes_multiple_descriptions_filtered
=True,
share_evaluation(stock_codes_multiple_descriptions_filtered, df_ecom_filtered, show_qty_rev=True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code', 'nunique'): 'Stock Codes Coverage'},
(=True) show_boxplots
======================================================================================================================================================
Evaluation of share: stock_codes_multiple_descriptions_filtered
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 50044 (9.5% of all entries)
Quantity: 562865 (10.5% of the total quantity)
Revenue: 1199770.2 (12.1% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into stock_codes_multiple_descriptions_filtered
.df_ecom_filtered
is generated in stock_codes_multiple_descriptions_filtered
.df_ecom_filtered
occurs in stock_codes_multiple_descriptions_filtered
. Every entry is counted separately, even if they are associated with the same order.stock_codes_multiple_descriptions_filtered
, it still counts as one full unique order in this chart.stock_codes_multiple_descriptions_filtered
, it still counts as one full unique stock code in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
======================================================================================================================================================
Observations
Decisions
# identifying the most frequent description for each stock code
= (
most_frequent_descriptions
stock_codes_multiple_descriptions_summary_filtered=['stock_code', 'count'], ascending=[True, False])
.sort_values(by=['stock_code']) # keeping only the first stock code entry, displaying most popular description
.drop_duplicates(subset'stock_code')['description'])
.set_index(
most_frequent_descriptions.head()
stock_code
16156L WRAP CAROUSEL
17107D FLOWER FAIRY,5 SUMMER B'DRAW LINERS
20622 VIPPASSPORT COVER
20681 PINK POLKADOT CHILDRENS UMBRELLA
20725 LUNCH BAG RED RETROSPOT
Name: description, dtype: object
# creating a column of most frequent (standard) descriptions
'standardized_description'] = stock_codes_multiple_descriptions_summary_filtered['stock_code'].map(most_frequent_descriptions)
stock_codes_multiple_descriptions_summary_filtered[5)
stock_codes_multiple_descriptions_summary_filtered.head(
# creating a list of most frequent (standard) descriptions
= list(stock_codes_multiple_descriptions_summary_filtered['standardized_description'].unique())
most_frequent_descriptions_list 5] most_frequent_descriptions_list[:
stock_code | description | count | standardized_description | |
---|---|---|---|---|
0 | 16156L | WRAP CAROUSEL | 14 | WRAP CAROUSEL |
1 | 16156L | WRAP, CAROUSEL | 4 | WRAP CAROUSEL |
2 | 17107D | FLOWER FAIRY,5 SUMMER B'DRAW LINERS | 25 | FLOWER FAIRY,5 SUMMER B'DRAW LINERS |
3 | 17107D | FLOWER FAIRY 5 DRAWER LINERS | 20 | FLOWER FAIRY,5 SUMMER B'DRAW LINERS |
4 | 17107D | FLOWER FAIRY 5 SUMMER DRAW LINERS | 1 | FLOWER FAIRY,5 SUMMER B'DRAW LINERS |
['WRAP CAROUSEL',
"FLOWER FAIRY,5 SUMMER B'DRAW LINERS",
'VIPPASSPORT COVER ',
'PINK POLKADOT CHILDRENS UMBRELLA',
'LUNCH BAG RED RETROSPOT']
# checking the result - initial and standardized (most popular) descriptions altogether with corresponding stock codes
stock_codes_multiple_descriptions_summary_filtered
stock_code | description | count | standardized_description | |
---|---|---|---|---|
0 | 16156L | WRAP CAROUSEL | 14 | WRAP CAROUSEL |
1 | 16156L | WRAP, CAROUSEL | 4 | WRAP CAROUSEL |
2 | 17107D | FLOWER FAIRY,5 SUMMER B'DRAW LINERS | 25 | FLOWER FAIRY,5 SUMMER B'DRAW LINERS |
3 | 17107D | FLOWER FAIRY 5 DRAWER LINERS | 20 | FLOWER FAIRY,5 SUMMER B'DRAW LINERS |
4 | 17107D | FLOWER FAIRY 5 SUMMER DRAW LINERS | 1 | FLOWER FAIRY,5 SUMMER B'DRAW LINERS |
... | ... | ... | ... | ... |
472 | 90014A | SILVER M.O.P. ORBIT NECKLACE | 6 | SILVER/MOP ORBIT NECKLACE |
473 | 90014B | GOLD M PEARL ORBIT NECKLACE | 13 | GOLD M PEARL ORBIT NECKLACE |
474 | 90014B | GOLD M.O.P. ORBIT NECKLACE | 2 | GOLD M PEARL ORBIT NECKLACE |
475 | 90014C | SILVER AND BLACK ORBIT NECKLACE | 2 | SILVER AND BLACK ORBIT NECKLACE |
476 | 90014C | SILVER/BLACK ORBIT NECKLACE | 2 | SILVER AND BLACK ORBIT NECKLACE |
477 rows × 4 columns
Observations
Decisions
Note: By checking odd only descriptions, we may slightly reduce the accuracy of corrections, but on the other hand we would dramatically save our efforts for the further study, which currently looks reasonable.
# getting the list of stop words
= set(stopwords.words('english'))
stop_words = most_frequent_descriptions_list
descriptions
= set()
mistakes for description in descriptions:
for word in description.split():
= word.strip("',. ").lower() # cleaning out punctuation and spaces from beginning and end of a description, if any
word_cleaned if (word_cleaned not in stop_words and not wn.synsets(word_cleaned)): # skipping stop words and checking WordNet lexical database
mistakes.add(word_cleaned)
print('\033[1mPossible mistakes in descriptions:\033[0m')
mistakes
Possible mistakes in descriptions:
{'&',
'+',
"50's",
"70's",
'ahoy',
'amelie',
'antoinette',
"b'draw",
"b'fly",
'botanique',
'c/cover',
'cakestand',
'candleholder',
"children's",
'childrens',
'crawlies',
'd.o.f',
'doiley',
'fairy,5',
'feltcraft',
'jardin',
'jean-paul',
'knick',
'marie',
'nicole',
'pannetone',
'polkadot',
'retrospot',
's/3',
's/4',
'set/5',
'set/6',
'silver/mop',
'smokey',
'snowflake,pink',
'spaceboy',
'squarecushion',
'suki',
't-light',
't-lights',
'vippassport',
'w/sucker'}
# filtering rows where `standardized_description` (lowercase) contains any of the mistakes
= (stock_codes_multiple_descriptions_summary_filtered['standardized_description'].str.lower()
filter_mask apply(lambda description: any(mistake in description for mistake in mistakes)))
.
# applying the filter and getting the DataFrame of descriptions containing possible mistakes
= stock_codes_multiple_descriptions_summary_filtered[filter_mask].copy()
exceptions_data
# adding a new column `mistake` that contains a possible mistake(s) found in the `standardized_description` column
'mistake'] = (exceptions_data['standardized_description'].str.lower()
exceptions_data[apply(lambda description: ', '.join([mistake for mistake in mistakes if mistake in description]))) # joining mistakes as a string
.
# displaying the filtered result
'display.max_rows', None) # displaying all rows
pd.set_option(
= (
exceptions_data_summary 'mistake','stock_code', 'standardized_description','description'])
exceptions_data.groupby(['count':'sum'})
.agg({# .reset_index()
=['standardized_description','count'], ascending=[False, False]))
.sort_values(by
exceptions_data_summarylen(exceptions_data_summary)
'display.max_rows') # resetting displaying all rows pd.reset_option(
count | ||||
---|---|---|---|---|
mistake | stock_code | standardized_description | description | |
t-light | 23145 | ZINC T-LIGHT HOLDER STAR LARGE | ZINC T-LIGHT HOLDER STAR LARGE | 170 |
ZINC T-LIGHT HOLDER STARS LARGE | 2 | |||
23086 | ZINC STAR T-LIGHT HOLDER | ZINC STAR T-LIGHT HOLDER | 46 | |
ZINC STAR T-LIGHT HOLDER | 1 | |||
doiley | 23231 | WRAP DOILEY DESIGN | WRAP DOILEY DESIGN | 164 |
WRAP VINTAGE DOILY | 94 | |||
WRAP VINTAGE DOILEY | 2 | |||
s/3 | 82486 | WOOD S/3 CABINET ANT WHITE FINISH | WOOD S/3 CABINET ANT WHITE FINISH | 414 |
3 DRAWER ANTIQUE WHITE WOOD CABINET | 205 | |||
t-light | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | WHITE HANGING HEART T-LIGHT HOLDER | 2278 |
CREAM HANGING HEART T-LIGHT HOLDER | 9 | |||
ahoy | 23523 | WALL ART TREASURE AHOY | WALL ART TREASURE AHOY | 25 |
TREASURE AHOY WALL ART | 16 | |||
spaceboy | 23528 | WALL ART SPACEBOY | WALL ART SPACEBOY | 37 |
SPACEBOY WALL ART | 13 | |||
& | 23524 | WALL ART HORSE & PONY | WALL ART HORSE & PONY | 46 |
HORSE & PONY WALL ART | 17 | |||
70's | 23542 | WALL ART 70'S ALPHABET | WALL ART 70'S ALPHABET | 76 |
70'S ALPHABET WALL ART | 15 | |||
vippassport | 20622 | VIPPASSPORT COVER | VIPPASSPORT COVER | 34 |
VIP PASSPORT COVER | 17 | |||
cakestand | 22776 | SWEETHEART CAKESTAND 3 TIER | SWEETHEART CAKESTAND 3 TIER | 398 |
SWEETHEART 3 TIER CAKE STAND | 165 | |||
CAKESTAND, 3 TIER, LOVEHEART | 1 | |||
squarecushion | 22785 | SQUARECUSHION COVER PINK UNION JACK | SQUARECUSHION COVER PINK UNION JACK | 42 |
SQUARECUSHION COVER PINK UNION FLAG | 32 | |||
spaceboy | 23389 | SPACEBOY MINI BACKPACK | SPACEBOY MINI BACKPACK | 236 |
SPACEBOY MINI RUCKSACK | 4 | |||
childrens, spaceboy | 23292 | SPACEBOY CHILDRENS CUP | SPACEBOY CHILDRENS CUP | 220 |
SPACE BOY CHILDRENS CUP | 6 | |||
smokey, d.o.f | 79051A | SMOKEY GREY COLOUR D.O.F. GLASS | SMOKEY GREY COLOUR D.O.F. GLASS | 27 |
SMOKEY GREY COLOUR GLASS | 15 | |||
silver/mop | 90014A | SILVER/MOP ORBIT NECKLACE | SILVER/MOP ORBIT NECKLACE | 15 |
SILVER M.O.P. ORBIT NECKLACE | 6 | |||
set/6 | 21090 | SET/6 COLLAGE PAPER PLATES | SET/6 COLLAGE PAPER PLATES | 218 |
WET/MOULDY | 1 | |||
set/5, retrospot | 20914 | SET/5 RED RETROSPOT LID GLASS BOWLS | SET/5 RED RETROSPOT LID GLASS BOWLS | 920 |
SET/5 RED SPOTTY LID GLASS BOWLS | 1 | |||
knick | 23237 | SET OF 4 KNICK KNACK TINS LEAF | SET OF 4 KNICK KNACK TINS LEAF | 127 |
SET OF 4 KNICK KNACK TINS LEAVES | 56 | |||
23240 | SET OF 4 KNICK KNACK TINS DOILY | SET OF 4 KNICK KNACK TINS DOILY | 370 | |
SET OF 4 KNICK KNACK TINS DOILEY | 190 | |||
SET OF 4 KNICK KNACK TINS DOILEY | 1 | |||
spaceboy | 22416 | SET OF 36 DOILIES SPACEBOY DESIGN | SET OF 36 DOILIES SPACEBOY DESIGN | 68 |
SET OF 36 SPACEBOY PAPER DOILIES | 9 | |||
t-light, t-lights | 23359 | SET OF 12 T-LIGHTS VINTAGE DOILY | SET OF 12 T-LIGHTS VINTAGE DOILY | 73 |
SET OF 12 T-LIGHTS VINTAGE DOILEY | 6 | |||
s/4 | 85184C | S/4 VALENTINE DECOUPAGE HEART BOX | S/4 VALENTINE DECOUPAGE HEART BOX | 131 |
SET 4 VALENTINE DECOUPAGE HEART BOX | 63 | |||
retrospot | 22602 | RETROSPOT WOODEN HEART DECORATION | RETROSPOT WOODEN HEART DECORATION | 254 |
CHRISTMAS RETROSPOT HEART WOOD | 28 | |||
polkadot, childrens | 20681 | PINK POLKADOT CHILDRENS UMBRELLA | PINK POLKADOT CHILDRENS UMBRELLA | 48 |
MIA | 1 | |||
b'fly, c/cover | 84906 | PINK B'FLY C/COVER W BOBBLES | PINK B'FLY C/COVER W BOBBLES | 7 |
PINK BUTTERFLY CUSHION COVER | 6 | |||
polkadot | 21243 | PINK POLKADOT PLATE | PINK POLKADOT PLATE | 186 |
PINK POLKADOT PLATE | 25 | |||
pannetone | 22584 | PACK OF 6 PANNETONE GIFT BOXES | PACK OF 6 PANNETONE GIFT BOXES | 180 |
PACK OF 6 PANETTONE GIFT BOXES | 19 | |||
22812 | PACK 3 BOXES CHRISTMAS PANNETONE | PACK 3 BOXES CHRISTMAS PANNETONE | 167 | |
PACK 3 BOXES CHRISTMAS PANETTONE | 25 | |||
22813 | PACK 3 BOXES BIRD PANNETONE | PACK 3 BOXES BIRD PANNETONE | 187 | |
PACK 3 BOXES BIRD PANETTONE | 31 | |||
marie, antoinette | 23071 | MARIE ANTOINETTE TRINKET BOX GOLD | MARIE ANTOINETTE TRINKET BOX GOLD | 21 |
MARIE ANTOIENETT TRINKET BOX GOLD | 1 | |||
suki | 22383 | LUNCH BAG SUKI DESIGN | LUNCH BAG SUKI DESIGN | 1117 |
LUNCH BAG SUKI DESIGN | 207 | |||
retrospot | 20725 | LUNCH BAG RED RETROSPOT | LUNCH BAG RED RETROSPOT | 1612 |
LUNCH BAG RED SPOTTY | 1 | |||
jardin, botanique | 23396 | LE JARDIN BOTANIQUE CUSHION COVER | LE JARDIN BOTANIQUE CUSHION COVER | 171 |
LA JARDIN BOTANIQUE CUSHION COVER | 28 | |||
BUTTERFLY CUSHION COVER | 2 | |||
jardin | 85144 | JARDIN ETCHED GLASS CHEESE DISH | JARDIN ETCHED GLASS CHEESE DISH | 41 |
JARDIN ETCHED GLASS BUTTER DISH | 1 | |||
t-light | 23484 | HEART TRELLIS TRIPLE T-LIGHT HOLDER | HEART TRELLIS TRIPLE T-LIGHT HOLDER | 31 |
HEART TRELLISTRIPLE T-LIGHT HOLDER | 5 | |||
71459 | HANGING JAM JAR T-LIGHT HOLDER | HANGING JAM JAR T-LIGHT HOLDER | 356 | |
HANGING JAM JAR T-LIGHT HOLDERS | 93 | |||
retrospot | 22632 | HAND WARMER RED RETROSPOT | HAND WARMER RED RETROSPOT | 387 |
HAND WARMER RED POLKA DOT | 18 | |||
+ | 21175 | GIN + TONIC DIET METAL SIGN | GIN + TONIC DIET METAL SIGN | 766 |
GIN AND TONIC DIET METAL SIGN | 50 | |||
retrospot | 22199 | FRYING PAN RED RETROSPOT | FRYING PAN RED RETROSPOT | 166 |
FRYING PAN RED POLKADOT | 1 | |||
t-light | 23056 | FLOWERS CHANDELIER T-LIGHT HOLDER | FLOWERS CHANDELIER T-LIGHT HOLDER | 41 |
CRYSTAL CHANDELIER T-LIGHT HOLDER | 1 | |||
w/sucker | 81950V | FLOWER PURPLE CLOCK W/SUCKER | FLOWER PURPLE CLOCK W/SUCKER | 3 |
FLOWER PURPLE CLOCK WITH SUCKER | 3 | |||
fairy,5, b'draw | 17107D | FLOWER FAIRY,5 SUMMER B'DRAW LINERS | FLOWER FAIRY,5 SUMMER B'DRAW LINERS | 25 |
FLOWER FAIRY 5 DRAWER LINERS | 20 | |||
FLOWER FAIRY 5 SUMMER DRAW LINERS | 1 | |||
feltcraft, nicole | 23127 | FELTCRAFT GIRL NICOLE KIT | FELTCRAFT GIRL NICOLE KIT | 186 |
DOLLCRAFT GIRL NICOLE | 12 | |||
feltcraft, amelie | 23126 | FELTCRAFT GIRL AMELIE KIT | FELTCRAFT GIRL AMELIE KIT | 281 |
DOLLCRAFT GIRL AMELIE | 8 | |||
DOLLCRAFT GIRL AMELIE KIT | 8 | |||
jean-paul, feltcraft | 23128 | FELTCRAFT BOY JEAN-PAUL KIT | FELTCRAFT BOY JEAN-PAUL KIT | 127 |
DOLLCRAFT BOY JEAN-PAUL | 12 | |||
t-light | 71477 | COLOUR GLASS. STAR T-LIGHT HOLDER | COLOUR GLASS. STAR T-LIGHT HOLDER | 286 |
COLOURED GLASS STAR T-LIGHT HOLDER | 59 | |||
childrens, spaceboy | 23256 | CHILDRENS CUTLERY SPACEBOY | CHILDRENS CUTLERY SPACEBOY | 367 |
KIDS CUTLERY SPACEBOY | 8 | |||
childrens, retrospot | 84997B | CHILDRENS CUTLERY RETROSPOT RED | CHILDRENS CUTLERY RETROSPOT RED | 266 |
RED 3 PIECE RETROSPOT CUTLERY SET | 109 | |||
84997b | CHILDRENS CUTLERY RETROSPOT RED | CHILDRENS CUTLERY RETROSPOT RED | 58 | |
RED 3 PIECE RETROSPOT CUTLERY SET | 12 | |||
polkadot, childrens | 84997D | CHILDRENS CUTLERY POLKADOT PINK | CHILDRENS CUTLERY POLKADOT PINK | 362 |
PINK 3 PIECE POLKADOT CUTLERY SET | 116 | |||
84997d | CHILDRENS CUTLERY POLKADOT PINK | CHILDRENS CUTLERY POLKADOT PINK | 67 | |
PINK 3 PIECE POLKADOT CUTLERY SET | 8 | |||
84997A | CHILDRENS CUTLERY POLKADOT GREEN | CHILDRENS CUTLERY POLKADOT GREEN | 189 | |
GREEN 3 PIECE POLKADOT CUTLERY SET | 74 | |||
84997a | CHILDRENS CUTLERY POLKADOT GREEN | CHILDRENS CUTLERY POLKADOT GREEN | 60 | |
GREEN 3 PIECE POLKADOT CUTLERY SET | 5 | |||
84997C | CHILDRENS CUTLERY POLKADOT BLUE | CHILDRENS CUTLERY POLKADOT BLUE | 235 | |
BLUE 3 PIECE POLKADOT CUTLERY SET | 102 | |||
84997c | CHILDRENS CUTLERY POLKADOT BLUE | CHILDRENS CUTLERY POLKADOT BLUE | 60 | |
BLUE 3 PIECE POLKADOT CUTLERY SET | 6 | |||
childrens | 23254 | CHILDRENS CUTLERY DOLLY GIRL | CHILDRENS CUTLERY DOLLY GIRL | 296 |
KIDS CUTLERY DOLLY GIRL | 8 | |||
spaceboy, children's | 22972 | CHILDREN'S SPACEBOY MUG | CHILDREN'S SPACEBOY MUG | 235 |
CHILDRENS SPACEBOY MUG | 2 | |||
candleholder | 22804 | CANDLEHOLDER PINK HANGING HEART | CANDLEHOLDER PINK HANGING HEART | 408 |
PINK HANGING HEART T-LIGHT HOLDER | 78 | |||
t-light | 23057 | BEADED CHANDELIER T-LIGHT HOLDER | BEADED CHANDELIER T-LIGHT HOLDER | 39 |
GEMSTONE CHANDELIER T-LIGHT HOLDER | 7 | |||
crawlies | 21830 | ASSORTED CREEPY CRAWLIES | ASSORTED CREEPY CRAWLIES | 101 |
MERCHANT CHANDLER CREDIT ERROR, STO | 1 | |||
t-light | 84946 | ANTIQUE SILVER T-LIGHT GLASS | ANTIQUE SILVER T-LIGHT GLASS | 711 |
ANTIQUE SILVER TEA GLASS ETCHED | 223 | |||
snowflake,pink | 35817P | ACRYLIC JEWEL SNOWFLAKE,PINK | ACRYLIC JEWEL SNOWFLAKE,PINK | 1 |
PINK ACRYLIC JEWEL SNOWFLAKE | 1 | |||
50's | 23437 | 50'S CHRISTMAS GIFT BAG LARGE | 50'S CHRISTMAS GIFT BAG LARGE | 130 |
GIFT BAG LARGE 50'S CHRISTMAS | 2 |
134
# checking descriptions of similar stock_codes
= ('stock_code == "84997A" or stock_code == "84997a" \
filter_mask or stock_code == "84997B" or stock_code == "84997b" \
or stock_code == "84997D" or stock_code == "84997d"')
'stock_code','description'])\
df_ecom_filtered.query(filter_mask).groupby(['unit_price':['mean', 'std']})
.agg({
'stock_code','description'])\
df_ecom_filtered.query(filter_mask).groupby(['unit_price':['mean', 'std']}).reset_index().sort_values(by='description') .agg({
unit_price | |||
---|---|---|---|
mean | std | ||
stock_code | description | ||
84997A | CHILDRENS CUTLERY POLKADOT GREEN | 4.60 | 1.35 |
GREEN 3 PIECE POLKADOT CUTLERY SET | 4.07 | 0.96 | |
84997B | CHILDRENS CUTLERY RETROSPOT RED | 4.49 | 1.21 |
RED 3 PIECE RETROSPOT CUTLERY SET | 4.07 | 1.00 | |
84997D | CHILDRENS CUTLERY POLKADOT PINK | 4.50 | 1.21 |
PINK 3 PIECE POLKADOT CUTLERY SET | 4.10 | 1.05 | |
84997a | CHILDRENS CUTLERY POLKADOT GREEN | 8.29 | 0.00 |
GREEN 3 PIECE POLKADOT CUTLERY SET | 8.29 | 0.00 | |
84997b | CHILDRENS CUTLERY RETROSPOT RED | 8.62 | 1.24 |
RED 3 PIECE RETROSPOT CUTLERY SET | 8.38 | 0.09 | |
84997d | CHILDRENS CUTLERY POLKADOT PINK | 8.43 | 0.83 |
PINK 3 PIECE POLKADOT CUTLERY SET | 8.36 | 0.09 |
stock_code | description | unit_price | ||
---|---|---|---|---|
mean | std | |||
0 | 84997A | CHILDRENS CUTLERY POLKADOT GREEN | 4.60 | 1.35 |
6 | 84997a | CHILDRENS CUTLERY POLKADOT GREEN | 8.29 | 0.00 |
4 | 84997D | CHILDRENS CUTLERY POLKADOT PINK | 4.50 | 1.21 |
10 | 84997d | CHILDRENS CUTLERY POLKADOT PINK | 8.43 | 0.83 |
2 | 84997B | CHILDRENS CUTLERY RETROSPOT RED | 4.49 | 1.21 |
8 | 84997b | CHILDRENS CUTLERY RETROSPOT RED | 8.62 | 1.24 |
1 | 84997A | GREEN 3 PIECE POLKADOT CUTLERY SET | 4.07 | 0.96 |
7 | 84997a | GREEN 3 PIECE POLKADOT CUTLERY SET | 8.29 | 0.00 |
5 | 84997D | PINK 3 PIECE POLKADOT CUTLERY SET | 4.10 | 1.05 |
11 | 84997d | PINK 3 PIECE POLKADOT CUTLERY SET | 8.36 | 0.09 |
3 | 84997B | RED 3 PIECE RETROSPOT CUTLERY SET | 4.07 | 1.00 |
9 | 84997b | RED 3 PIECE RETROSPOT CUTLERY SET | 8.38 | 0.09 |
Observations
Decisions
.map()
method we will create the standardized_description_fixed
column with most correct descriptions.Note 1: We observed several cases where the same descriptions are represented by very similar stock codes, and only the text register of one letter differs (e.g., “A” vs. “a” and “D” vs. “d”). We could unite such descriptions and stock codes, but will not do so since this is not an isolated case and we lack information about such naming. Furthermore, we checked that average unit prices of such similar stock codes differ a lot - approximately twice, what supports our decision not to unite them. At the same time mean unit prices of products related to the same stock code with similar descriptions but most likely of different packages/amount, show very similar mean prices. Nevertheless, it seems more safe to maintain them distinguished.
Note 2: We discovered a couple more types of manual corrections with descriptions: “MERCHANT CHANDLER CREDIT ERROR, STO” and “MIA”. They haven’t been caught before, as they are written in upper-case like normal products, while prior we saw manual corrections described in lower-case text. Such corrections represent a negligible amount of data, so it’s even not worth our efforts to address them.
# creating a dictionary to address mistakes in descriptions or their not best choices for "standard descriptions"
= {'VIPPASSPORT COVER': 'VIP PASSPORT COVER',
description_correction 'SQUARECUSHION COVER PINK UNION JACK': 'SQUARE CUSHION COVER PINK UNION JACK',
'WOOD S/3 CABINET ANT WHITE FINISH': '3 DRAWER ANTIQUE WHITE WOOD CABINET',
'S/4 VALENTINE DECOUPAGE HEART BOX': 'SET 4 VALENTINE DECOUPAGE HEART BOX',
'FLOWER PURPLE CLOCK W/SUCKER': 'FLOWER PURPLE CLOCK WITH SUCKER'}
# correcting the descriptions in the list of most frequent (standard) descriptions
= most_frequent_descriptions.map(lambda descr: description_correction.get(descr.strip(), descr.strip())) # cleaning spaces from beginning and end of a description (as it appeared that for instance 'VIPPASSPORT COVER' is in fact 'VIPPASSPORT COVER ' - an extra space in the end)
most_frequent_descriptions_fixed
# creating a list of descriptions that shouldn't be changed to most frequent (standard) descriptions
= ['CREAM HANGING HEART T-LIGHT HOLDER', 'GREEN 3 PIECE POLKADOT CUTLERY SET', 'BLUE 3 PIECE POLKADOT CUTLERY SET', 'PINK 3 PIECE POLKADOT CUTLERY SET']
white_descriptions
= exceptions_data_summary.reset_index()
exceptions_data_summary = exceptions_data_summary[['stock_code', 'description', 'count','standardized_description','mistake']] # changing order of columns for consistency
exceptions_data_summary
'standardized_description_fixed'] = (
exceptions_data_summary[apply(lambda row:
exceptions_data_summary.reset_index() .# cleaning spaces from beginning and end of a description
'description'].strip() if row['description'].strip() in white_descriptions
row[else
# replacing a description if it's present in the "description_correction", if it's not present - remain it unchanged
'standardized_description'].strip(), row['standardized_description'].strip()),
description_correction.get(row[=1))
axis
# checking the result
= list(description_correction.values())
description_correction_values
print('\033[1mAll the entries with updated standardized descriptions:\033[0m')
'standardized_description_fixed in @description_correction_values')
exceptions_data_summary.query(print('\n\033[1mRandom entries with NOT updated standardized descriptions:\033[0m')
'standardized_description_fixed not in @description_correction_values').sample(3)
exceptions_data_summary.query(print('\n\033[1mAll the entries with descriptions from the "white list" (keeping original descriptions):\033[0m')
'description in @white_descriptions') exceptions_data_summary.query(
All the entries with updated standardized descriptions:
stock_code | description | count | standardized_description | mistake | standardized_description_fixed | |
---|---|---|---|---|---|---|
7 | 82486 | WOOD S/3 CABINET ANT WHITE FINISH | 414 | WOOD S/3 CABINET ANT WHITE FINISH | s/3 | 3 DRAWER ANTIQUE WHITE WOOD CABINET |
8 | 82486 | 3 DRAWER ANTIQUE WHITE WOOD CABINET | 205 | WOOD S/3 CABINET ANT WHITE FINISH | s/3 | 3 DRAWER ANTIQUE WHITE WOOD CABINET |
19 | 20622 | VIPPASSPORT COVER | 34 | VIPPASSPORT COVER | vippassport | VIP PASSPORT COVER |
20 | 20622 | VIP PASSPORT COVER | 17 | VIPPASSPORT COVER | vippassport | VIP PASSPORT COVER |
24 | 22785 | SQUARECUSHION COVER PINK UNION JACK | 42 | SQUARECUSHION COVER PINK UNION JACK | squarecushion | SQUARE CUSHION COVER PINK UNION JACK |
25 | 22785 | SQUARECUSHION COVER PINK UNION FLAG | 32 | SQUARECUSHION COVER PINK UNION JACK | squarecushion | SQUARE CUSHION COVER PINK UNION JACK |
47 | 85184C | S/4 VALENTINE DECOUPAGE HEART BOX | 131 | S/4 VALENTINE DECOUPAGE HEART BOX | s/4 | SET 4 VALENTINE DECOUPAGE HEART BOX |
48 | 85184C | SET 4 VALENTINE DECOUPAGE HEART BOX | 63 | S/4 VALENTINE DECOUPAGE HEART BOX | s/4 | SET 4 VALENTINE DECOUPAGE HEART BOX |
86 | 81950V | FLOWER PURPLE CLOCK W/SUCKER | 3 | FLOWER PURPLE CLOCK W/SUCKER | w/sucker | FLOWER PURPLE CLOCK WITH SUCKER |
87 | 81950V | FLOWER PURPLE CLOCK WITH SUCKER | 3 | FLOWER PURPLE CLOCK W/SUCKER | w/sucker | FLOWER PURPLE CLOCK WITH SUCKER |
Random entries with NOT updated standardized descriptions:
stock_code | description | count | standardized_description | mistake | standardized_description_fixed | |
---|---|---|---|---|---|---|
23 | 22776 | CAKESTAND, 3 TIER, LOVEHEART | 1 | SWEETHEART CAKESTAND 3 TIER | cakestand | SWEETHEART CAKESTAND 3 TIER |
113 | 84997a | GREEN 3 PIECE POLKADOT CUTLERY SET | 5 | CHILDRENS CUTLERY POLKADOT GREEN | polkadot, childrens | GREEN 3 PIECE POLKADOT CUTLERY SET |
93 | 23126 | FELTCRAFT GIRL AMELIE KIT | 281 | FELTCRAFT GIRL AMELIE KIT | feltcraft, amelie | FELTCRAFT GIRL AMELIE KIT |
All the entries with descriptions from the "white list" (keeping original descriptions):
stock_code | description | count | standardized_description | mistake | standardized_description_fixed | |
---|---|---|---|---|---|---|
10 | 85123A | CREAM HANGING HEART T-LIGHT HOLDER | 9 | WHITE HANGING HEART T-LIGHT HOLDER | t-light | CREAM HANGING HEART T-LIGHT HOLDER |
107 | 84997D | PINK 3 PIECE POLKADOT CUTLERY SET | 116 | CHILDRENS CUTLERY POLKADOT PINK | polkadot, childrens | PINK 3 PIECE POLKADOT CUTLERY SET |
109 | 84997d | PINK 3 PIECE POLKADOT CUTLERY SET | 8 | CHILDRENS CUTLERY POLKADOT PINK | polkadot, childrens | PINK 3 PIECE POLKADOT CUTLERY SET |
111 | 84997A | GREEN 3 PIECE POLKADOT CUTLERY SET | 74 | CHILDRENS CUTLERY POLKADOT GREEN | polkadot, childrens | GREEN 3 PIECE POLKADOT CUTLERY SET |
113 | 84997a | GREEN 3 PIECE POLKADOT CUTLERY SET | 5 | CHILDRENS CUTLERY POLKADOT GREEN | polkadot, childrens | GREEN 3 PIECE POLKADOT CUTLERY SET |
115 | 84997C | BLUE 3 PIECE POLKADOT CUTLERY SET | 102 | CHILDRENS CUTLERY POLKADOT BLUE | polkadot, childrens | BLUE 3 PIECE POLKADOT CUTLERY SET |
117 | 84997c | BLUE 3 PIECE POLKADOT CUTLERY SET | 6 | CHILDRENS CUTLERY POLKADOT BLUE | polkadot, childrens | BLUE 3 PIECE POLKADOT CUTLERY SET |
# creating a DataFrame of descriptions and relating standard descriptions, that have been fixed (for cases of stock codes having multiple descriptions)
= exceptions_data_summary[['description','standardized_description_fixed']]
fixed_descriptions fixed_descriptions
description | standardized_description_fixed | |
---|---|---|
0 | ZINC T-LIGHT HOLDER STAR LARGE | ZINC T-LIGHT HOLDER STAR LARGE |
1 | ZINC T-LIGHT HOLDER STARS LARGE | ZINC T-LIGHT HOLDER STAR LARGE |
2 | ZINC STAR T-LIGHT HOLDER | ZINC STAR T-LIGHT HOLDER |
3 | ZINC STAR T-LIGHT HOLDER | ZINC STAR T-LIGHT HOLDER |
4 | WRAP DOILEY DESIGN | WRAP DOILEY DESIGN |
... | ... | ... |
129 | ANTIQUE SILVER TEA GLASS ETCHED | ANTIQUE SILVER T-LIGHT GLASS |
130 | ACRYLIC JEWEL SNOWFLAKE,PINK | ACRYLIC JEWEL SNOWFLAKE,PINK |
131 | PINK ACRYLIC JEWEL SNOWFLAKE | ACRYLIC JEWEL SNOWFLAKE,PINK |
132 | 50'S CHRISTMAS GIFT BAG LARGE | 50'S CHRISTMAS GIFT BAG LARGE |
133 | GIFT BAG LARGE 50'S CHRISTMAS | 50'S CHRISTMAS GIFT BAG LARGE |
134 rows × 2 columns
# creating a DataFrame of descriptions and relating standard descriptions - the full list (for cases of stock codes having multiple descriptions)
= stock_codes_multiple_descriptions_summary_filtered[['description','standardized_description']]
full_multiple_descriptions full_multiple_descriptions
description | standardized_description | |
---|---|---|
0 | WRAP CAROUSEL | WRAP CAROUSEL |
1 | WRAP, CAROUSEL | WRAP CAROUSEL |
2 | FLOWER FAIRY,5 SUMMER B'DRAW LINERS | FLOWER FAIRY,5 SUMMER B'DRAW LINERS |
3 | FLOWER FAIRY 5 DRAWER LINERS | FLOWER FAIRY,5 SUMMER B'DRAW LINERS |
4 | FLOWER FAIRY 5 SUMMER DRAW LINERS | FLOWER FAIRY,5 SUMMER B'DRAW LINERS |
... | ... | ... |
472 | SILVER M.O.P. ORBIT NECKLACE | SILVER/MOP ORBIT NECKLACE |
473 | GOLD M PEARL ORBIT NECKLACE | GOLD M PEARL ORBIT NECKLACE |
474 | GOLD M.O.P. ORBIT NECKLACE | GOLD M PEARL ORBIT NECKLACE |
475 | SILVER AND BLACK ORBIT NECKLACE | SILVER AND BLACK ORBIT NECKLACE |
476 | SILVER/BLACK ORBIT NECKLACE | SILVER AND BLACK ORBIT NECKLACE |
477 rows × 2 columns
# merging the DataFrames
= full_multiple_descriptions.merge(fixed_descriptions, on ='description', how='outer', indicator = True) # adding a column indicating source of each row
multiple_descriptions_merged
# checking the result
multiple_descriptions_merged'_merge'].value_counts() multiple_descriptions_merged[
description | standardized_description | standardized_description_fixed | _merge | |
---|---|---|---|---|
0 | 50'S CHRISTMAS GIFT BAG LARGE | 50'S CHRISTMAS GIFT BAG LARGE | 50'S CHRISTMAS GIFT BAG LARGE | both |
1 | I LOVE LONDON MINI BACKPACK | I LOVE LONDON MINI BACKPACK | NaN | left_only |
2 | I LOVE LONDON MINI RUCKSACK | I LOVE LONDON MINI BACKPACK | NaN | left_only |
3 | RED SPOT GIFT BAG LARGE | RED SPOT GIFT BAG LARGE | NaN | left_only |
4 | SET 2 TEA TOWELS I LOVE LONDON | SET 2 TEA TOWELS I LOVE LONDON | NaN | left_only |
... | ... | ... | ... | ... |
488 | ZINC HERB GARDEN CONTAINER | ZINC HERB GARDEN CONTAINER | NaN | left_only |
489 | ZINC PLANT POT HOLDER | ZINC HEARTS PLANT POT HOLDER | NaN | left_only |
490 | ZINC STAR T-LIGHT HOLDER | ZINC STAR T-LIGHT HOLDER | ZINC STAR T-LIGHT HOLDER | both |
491 | ZINC T-LIGHT HOLDER STAR LARGE | ZINC T-LIGHT HOLDER STAR LARGE | ZINC T-LIGHT HOLDER STAR LARGE | both |
492 | ZINC T-LIGHT HOLDER STARS LARGE | ZINC T-LIGHT HOLDER STAR LARGE | ZINC T-LIGHT HOLDER STAR LARGE | both |
493 rows × 4 columns
_merge
left_only 343
both 150
right_only 0
Name: count, dtype: int64
# adding the `standardized_description_final` column
'standardized_description_final'] = (
multiple_descriptions_merged['standardized_description_fixed'].where(multiple_descriptions_merged['_merge'] == "both", # we keep a value of `standardized_description_fixed` column, if it exists
multiple_descriptions_merged['standardized_description'])) # otherwise we keep a value of `standardized_description` column
multiple_descriptions_merged[
multiple_descriptions_merged
description | standardized_description | standardized_description_fixed | _merge | standardized_description_final | |
---|---|---|---|---|---|
0 | 50'S CHRISTMAS GIFT BAG LARGE | 50'S CHRISTMAS GIFT BAG LARGE | 50'S CHRISTMAS GIFT BAG LARGE | both | 50'S CHRISTMAS GIFT BAG LARGE |
1 | I LOVE LONDON MINI BACKPACK | I LOVE LONDON MINI BACKPACK | NaN | left_only | I LOVE LONDON MINI BACKPACK |
2 | I LOVE LONDON MINI RUCKSACK | I LOVE LONDON MINI BACKPACK | NaN | left_only | I LOVE LONDON MINI BACKPACK |
3 | RED SPOT GIFT BAG LARGE | RED SPOT GIFT BAG LARGE | NaN | left_only | RED SPOT GIFT BAG LARGE |
4 | SET 2 TEA TOWELS I LOVE LONDON | SET 2 TEA TOWELS I LOVE LONDON | NaN | left_only | SET 2 TEA TOWELS I LOVE LONDON |
... | ... | ... | ... | ... | ... |
488 | ZINC HERB GARDEN CONTAINER | ZINC HERB GARDEN CONTAINER | NaN | left_only | ZINC HERB GARDEN CONTAINER |
489 | ZINC PLANT POT HOLDER | ZINC HEARTS PLANT POT HOLDER | NaN | left_only | ZINC HEARTS PLANT POT HOLDER |
490 | ZINC STAR T-LIGHT HOLDER | ZINC STAR T-LIGHT HOLDER | ZINC STAR T-LIGHT HOLDER | both | ZINC STAR T-LIGHT HOLDER |
491 | ZINC T-LIGHT HOLDER STAR LARGE | ZINC T-LIGHT HOLDER STAR LARGE | ZINC T-LIGHT HOLDER STAR LARGE | both | ZINC T-LIGHT HOLDER STAR LARGE |
492 | ZINC T-LIGHT HOLDER STARS LARGE | ZINC T-LIGHT HOLDER STAR LARGE | ZINC T-LIGHT HOLDER STAR LARGE | both | ZINC T-LIGHT HOLDER STAR LARGE |
493 rows × 5 columns
# creating a dictionary of original descriptions and their final version to be maintained
= (multiple_descriptions_merged[['description', 'standardized_description_final']].set_index('description')
multiple_descriptions_merged_dict 'standardized_description_final'].to_dict()) [
# adding the `standardized_description_final` column to the `df_ecom_filtered` by mapping descriptions to their standardized versions if available, otherwise keeping the original description
'standardized_description_final'] = df_ecom_filtered['description'].map(lambda descr: multiple_descriptions_merged_dict.get(descr, descr)) df_ecom_filtered[
# checking some of addressed descriptions
print(f'\033[1mExamples of stock codes and descriptions that are supposed to be modified:\033[0m')
'stock_code == "20725"').groupby(['stock_code','standardized_description_final'])['description'].value_counts()
df_ecom_filtered.query('stock_code == "20622"').groupby(['stock_code','standardized_description_final'])['description'].value_counts()
df_ecom_filtered.query(
print(f'\n\033[1mExamples of stock codes and descriptions that are supposed to stay unchanged:\033[0m')
'stock_code == "85123A"').groupby(['stock_code','standardized_description_final'])['description'].value_counts()
df_ecom_filtered.query('stock_code == "84997A"').groupby(['stock_code','standardized_description_final'])['description'].value_counts() df_ecom_filtered.query(
Examples of stock codes and descriptions that are supposed to be modified:
stock_code standardized_description_final description
20725 LUNCH BAG RED RETROSPOT LUNCH BAG RED RETROSPOT 1612
LUNCH BAG RED SPOTTY 1
Name: count, dtype: int64
stock_code standardized_description_final description
20622 VIP PASSPORT COVER VIPPASSPORT COVER 34
VIP PASSPORT COVER 17
Name: count, dtype: int64
Examples of stock codes and descriptions that are supposed to stay unchanged:
stock_code standardized_description_final description
85123A CREAM HANGING HEART T-LIGHT HOLDER CREAM HANGING HEART T-LIGHT HOLDER 9
WHITE HANGING HEART T-LIGHT HOLDER WHITE HANGING HEART T-LIGHT HOLDER 2278
Name: count, dtype: int64
stock_code standardized_description_final description
84997A CHILDRENS CUTLERY POLKADOT GREEN CHILDRENS CUTLERY POLKADOT GREEN 189
GREEN 3 PIECE POLKADOT CUTLERY SET GREEN 3 PIECE POLKADOT CUTLERY SET 74
Name: count, dtype: int64
# creating a DataFrame of stock codes associated with numerous description - based on already addressed descriptions
= (df_ecom_filtered.groupby('stock_code')['standardized_description_final'].nunique()
stock_codes_multiple_descriptions_fixed
.reset_index()='standardized_description_final')
.sort_values(by'standardized_description_final > 1'))
.query(
# checking the result
= len(stock_codes_multiple_descriptions)
initial_number_stock_codes = len(stock_codes_multiple_descriptions_fixed)
revised_number_stock_codes = 1 - (initial_number_stock_codes - revised_number_stock_codes) / initial_number_stock_codes
share_remaining = list(stock_codes_multiple_descriptions_fixed['stock_code'].unique())
stock_codes_remaining
#display(Markdown(f'**Stock codes associated with numerous descriptions**'))
print("="*130)
print(f'\033[1mStock codes associated with numerous descriptions: ')
print(f'\033[1m - Initial number:\033[0m {len(stock_codes_multiple_descriptions)}')
print(f'\033[1m - Number and remaining share after revision:\033[0m {len(stock_codes_multiple_descriptions_fixed)} ({share_remaining * 100 :0.1f}%)')
print(f'\033[1m - Stock codes remaining after revision:\033[0m {stock_codes_remaining})')
print("="*130)
==================================================================================================================================
Stock codes associated with numerous descriptions:
- Initial number: 642
- Number and remaining share after revision: 9 (1.4%)
- Stock codes remaining after revision: ['84997A', '23235', '85123A', '84997d', '84997c', '84997a', '84997C', '23040', '84997D'])
==================================================================================================================================
# creating a DataFrame of entries associated with remaining stock codes with numerous descriptions
= df_ecom_filtered.query('stock_code in @stock_codes_remaining').sort_values(by='stock_code')
stock_codes_multiple_descriptions_fixed
# checking the share of entries associated with remaining stock codes with numerous descriptions
share_evaluation(stock_codes_multiple_descriptions_fixed, df_ecom_filtered, =True,
show_qty_rev=True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code', 'nunique'): 'Stock Codes Coverage'},
(=True,
show_pie_charts_notes=True, example_type='sample', random_state=11, example_limit=3) show_example
======================================================================================================================================================
Evaluation of share: stock_codes_multiple_descriptions_fixed
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 3725 (0.7% of all entries)
Quantity: 50711 (0.9% of the total quantity)
Revenue: 157597.8 (1.6% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into stock_codes_multiple_descriptions_fixed
.df_ecom_filtered
is generated in stock_codes_multiple_descriptions_fixed
.df_ecom_filtered
occurs in stock_codes_multiple_descriptions_fixed
. Every entry is counted separately, even if they are associated with the same order.stock_codes_multiple_descriptions_fixed
, it still counts as one full unique order in this chart.stock_codes_multiple_descriptions_fixed
, it still counts as one full unique stock code in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code description quantity invoice_date unit_price customer_id invoice_year invoice_month \
541531 581498 84997d CHILDRENS CUTLERY POLKADOT PINK 1 2019-12-07 10:26:00 8.29 0 2019 12
432726 573889 85123A WHITE HANGING HEART T-LIGHT HOLDER 2 2019-10-30 13:44:00 2.95 13571 2019 10
248446 558835 84997a CHILDRENS CUTLERY POLKADOT GREEN 1 2019-07-02 11:58:00 8.29 0 2019 7
invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue \
541531 2019-12 49 2019-Week-49 2019-12-07 5 Saturday 8.29
432726 2019-10 44 2019-Week-44 2019-10-30 2 Wednesday 5.90
248446 2019-07 27 2019-Week-27 2019-07-02 1 Tuesday 8.29
standardized_description_final
541531 CHILDRENS CUTLERY POLKADOT PINK
432726 WHITE HANGING HEART T-LIGHT HOLDER
248446 CHILDRENS CUTLERY POLKADOT GREEN
======================================================================================================================================================
For comparison, let’s recollect the share of such entries prior to the current revision.
=True, show_example=False) share_evaluation(stock_codes_multiple_descriptions_filtered, df_ecom_filtered, show_qty_rev
======================================================================================================================================================
Evaluation of share: stock_codes_multiple_descriptions_filtered
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 50044 (9.5% of all entries)
Quantity: 562865 (10.5% of the total quantity)
Revenue: 1199770.2 (12.1% of the total revenue)
======================================================================================================================================================
To simplify naming perception, we will rename long-named standardized_description_fixed
column into just description
, and will use it for further studies. In the same time to maintain the original product names, we will maintain the original descriptions under the column initial_description
.
# renaming columns
if 'initial_description' not in df_ecom_filtered.columns: # checking if the renaming has been already performed - to avoid a harmful cell multi-run issue
= df_ecom_filtered.copy()
df_ecom_filtered = df_ecom_filtered.rename(columns={
df_ecom_filtered 'description': 'initial_description',
'standardized_description_final': 'description'})
# checking the result
df_ecom_filtered.columns
Index(['invoice_no', 'stock_code', 'initial_description', 'quantity', 'invoice_date', 'unit_price', 'customer_id', 'invoice_year', 'invoice_month',
'invoice_year_month', 'invoice_week', 'invoice_year_week', 'invoice_day', 'invoice_day_of_week', 'invoice_day_name', 'revenue',
'description'],
dtype='object')
Following our review of data on stock codes with numerous descriptions, let’s check remaining descriptions associated with numerous stock codes.
During the previous step, we cleaned the suspicious descriptions and faced a problem where an extra space caused an unobvious duplicate of the description (“VIPPASSPORT COVER” vs “VIPPASSPORT COVER”). We will now check all the descriptions to ensure such instances don’t occur. We will check extra (unwanted) spaces not only in the edges of text, but also double spaces in the middle.
# identifying descriptions with spacing issues
= (df_ecom_filtered['description']
with_spacing_issues str.contains(r'(^\s+)|(\s+$)|(\s{2,})', regex=True)) # checking instances of spaces in the beginning, in the end, or 2 and more spaces within the text
.
= df_ecom_filtered[with_spacing_issues]['description'].nunique()
spacing_issues_number = list(df_ecom_filtered[with_spacing_issues]['description'].unique()[:10])
spacing_issues_examples = df_ecom_filtered['description'].nunique()
descriptions_initial_number
# normalizing descriptions by removing unnecessary spacing
'description'] = df_ecom_filtered['description'].str.strip() # removing unnecessary spaces at the edges of strings
df_ecom_filtered['description'] = df_ecom_filtered['description'].str.replace(r'\s+', ' ', regex=True) # replacing multiple consecutive spaces within strings with a single space
df_ecom_filtered[
# checking the result
= with_spacing_issues.sum()
with_spacing_issues_count = df_ecom_filtered['description'].nunique()
descriptions_filtered_number = descriptions_initial_number - descriptions_filtered_number
addressed_duplicates
print('='*table_width)
print(f'\033[1mTotal number of unique descriptions:\033[0m {descriptions_initial_number}')
print(f'\033[1mNumber of descriptions with spacing issues:\033[0m {spacing_issues_number}')
print(f'\033[1mExamples of descriptions with spacing issues:\033[0m')
print(spacing_issues_examples)
print('-'*table_width)
print(f'\033[1mTotal number of unique descriptions after filtering:\033[0m {descriptions_filtered_number} ({addressed_duplicates} unobvious description duplicates addressed)')
print('='*table_width)
======================================================================================================================================================
Total number of unique descriptions: 3808
Number of descriptions with spacing issues: 809
Examples of descriptions with spacing issues:
["POPPY'S PLAYHOUSE BEDROOM ", 'IVORY KNITTED MUG COSY ', 'BOX OF VINTAGE JIGSAW BLOCKS ', 'ALARM CLOCK BAKELIKE RED ', 'STARS GIFT TAPE ', 'INFLATABLE POLITICAL GLOBE ', 'VINTAGE HEADS AND TAILS CARD GAME ', 'SET/2 RED RETROSPOT TEA TOWELS ', 'ROUND SNACK BOXES SET OF4 WOODLAND ', 'SPACEBOY LUNCH BOX ']
------------------------------------------------------------------------------------------------------------------------------------------------------
Total number of unique descriptions after filtering: 3798 (10 unobvious description duplicates addressed)
======================================================================================================================================================
# checking remaining descriptions with multiple stock codes
= (
descriptions_multiple_stock_codes_fixed 'description')['stock_code'].nunique()
df_ecom_filtered.groupby(=False)
.sort_values(ascending
.reset_index()'stock_code > 1'))
.query(
descriptions_multiple_stock_codes_fixed
description | stock_code | |
---|---|---|
0 | METAL SIGN,CUPCAKE SINGLE HOOK | 6 |
1 | SET OF 4 FAIRY CAKE PLACEMATS | 4 |
2 | COLUMBIAN CANDLE ROUND | 3 |
3 | DOORMAT BLACK FLOCK | 2 |
4 | CHILDRENS CUTLERY POLKADOT BLUE | 2 |
... | ... | ... |
129 | 3D SHEET OF DOG STICKERS | 2 |
130 | ICON PLACEMAT POP ART ELVIS | 2 |
131 | PINK FAIRY CAKE CHILDRENS APRON | 2 |
132 | ROSE DU SUD CUSHION COVER | 2 |
133 | LUSH GREENS RIBBONS | 2 |
134 rows × 2 columns
# checking descriptions having the most number of stock codes
= descriptions_multiple_stock_codes_fixed.query('stock_code > 2')['description'].to_list()
description_over_two_stock_codes
'description in @description_over_two_stock_codes').groupby('description')['stock_code'].value_counts() df_ecom_filtered.query(
description stock_code
COLUMBIAN CANDLE ROUND 72128 36
72127 31
72130 28
METAL SIGN,CUPCAKE SINGLE HOOK 82613B 112
82613C 97
82613A 18
82613b 4
82613c 4
82613a 1
SET OF 4 FAIRY CAKE PLACEMATS 84509B 80
84509G 66
84509b 4
84509g 2
Name: count, dtype: int64
We see, that stock code numbers, associated with the same descriptions are generally the same, only a letter in the end differs. We can also see an exclusion for “COLUMBIAN CANDLE ROUND” description, where stock code numbers a very close, but not the same.
# checking stock codes of remaining descriptions with multiple stock codes
= set(descriptions_multiple_stock_codes_fixed['description'])
remaining_descriptions = (
descriptions_multiple_stock_codes_fixed_summary 'description in @remaining_descriptions')
df_ecom_filtered.query('initial_description','description','stock_code'])
.groupby(['invoice_no':'count', 'unit_price':['mean', 'std']}))
.agg({
# flattening column names instead of maintaining multiindex
= [
descriptions_multiple_stock_codes_fixed_summary.columns f'{column[0]}_{column[1]}' if column[1] else column[0]
for column in descriptions_multiple_stock_codes_fixed_summary.columns]
descriptions_multiple_stock_codes_fixed_summary
invoice_no_count | unit_price_mean | unit_price_std | |||
---|---|---|---|---|---|
initial_description | description | stock_code | |||
3 GARDENIA MORRIS BOXED CANDLES | 3 GARDENIA MORRIS BOXED CANDLES | 85034A | 83 | 2.79 | 2.18 |
85034a | 3 | 8.29 | 0.00 | ||
3 WHITE CHOC MORRIS BOXED CANDLES | 3 WHITE CHOC MORRIS BOXED CANDLES | 85034B | 122 | 2.72 | 2.23 |
85034b | 1 | 8.29 | NaN | ||
3D DOG PICTURE PLAYING CARDS | 3D DOG PICTURE PLAYING CARDS | 84558A | 82 | 3.12 | 0.87 |
... | ... | ... | ... | ... | ... |
WOVEN BUBBLE GUM CUSHION COVER | WOVEN BUBBLE GUM CUSHION COVER | 46776a | 1 | 4.13 | NaN |
WOVEN CANDY CUSHION COVER | WOVEN CANDY CUSHION COVER | 46776E | 38 | 4.24 | 0.28 |
46776e | 1 | 4.13 | NaN | ||
WOVEN ROSE GARDEN CUSHION COVER | WOVEN ROSE GARDEN CUSHION COVER | 46776F | 89 | 4.21 | 0.24 |
46776f | 1 | 4.13 | NaN |
284 rows × 3 columns
# creating a DataFrame of entries associated with remaining descriptions with numerous stock codes
= df_ecom_filtered.query('description in @remaining_descriptions')
descriptions_multiple_stock_codes_fixed_entries
# checking the share of entries associated with remaining descriptions with numerous stock codes
share_evaluation(descriptions_multiple_stock_codes_fixed_entries, df_ecom_filtered, =True,
show_qty_rev=True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('description', 'nunique'): 'Descriptions Coverage'},
(=True) show_pie_charts_notes
======================================================================================================================================================
Evaluation of share: descriptions_multiple_stock_codes_fixed_entries
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 24577 (4.6% of all entries)
Quantity: 205915 (3.8% of the total quantity)
Revenue: 494960.6 (5.0% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into descriptions_multiple_stock_codes_fixed_entries
.df_ecom_filtered
is generated in descriptions_multiple_stock_codes_fixed_entries
.df_ecom_filtered
occurs in descriptions_multiple_stock_codes_fixed_entries
. Every entry is counted separately, even if they are associated with the same order.descriptions_multiple_stock_codes_fixed_entries
, it still counts as one full unique order in this chart.descriptions_multiple_stock_codes_fixed_entries
, it still counts as one full unique description in this chart.======================================================================================================================================================
Let’s inspect the stock codes of the remaining descriptions with multiple stock codes. As we have already noticed they contain either uppercase or lowercase letters, we will address such entries accordingly if we find any meaningful insights.
We will use np.select()
method to create the stock_code_letters
column, thus identifying letters presence in such stock codes.
= descriptions_multiple_stock_codes_fixed_summary.reset_index()
descriptions_multiple_stock_codes_fixed_summary
= [descriptions_multiple_stock_codes_fixed_summary['stock_code'].str.contains('[a-z]'),
conditions 'stock_code'].str.contains('[A-Z]')]
descriptions_multiple_stock_codes_fixed_summary[= ['has lowercase letter',
choices 'has uppercase letter']
'stock_code_letters'] = np.select(conditions, choices, default='without letters')
descriptions_multiple_stock_codes_fixed_summary[3) descriptions_multiple_stock_codes_fixed_summary.head(
initial_description | description | stock_code | invoice_no_count | unit_price_mean | unit_price_std | stock_code_letters | |
---|---|---|---|---|---|---|---|
0 | 3 GARDENIA MORRIS BOXED CANDLES | 3 GARDENIA MORRIS BOXED CANDLES | 85034A | 83 | 2.79 | 2.18 | has uppercase letter |
1 | 3 GARDENIA MORRIS BOXED CANDLES | 3 GARDENIA MORRIS BOXED CANDLES | 85034a | 3 | 8.29 | 0.00 | has lowercase letter |
2 | 3 WHITE CHOC MORRIS BOXED CANDLES | 3 WHITE CHOC MORRIS BOXED CANDLES | 85034B | 122 | 2.72 | 2.23 | has uppercase letter |
= (descriptions_multiple_stock_codes_fixed_summary.groupby('stock_code_letters')
remainnig_stock_codes_summary 'unit_price_mean':'mean',
.agg({'unit_price_std':'mean',
'stock_code_letters':'count',
'initial_description':'nunique',
'description':'nunique'}))
= ['unit_price_mean', 'unit_price_std', 'stock_codes_number', 'initial_descriptions_number_unique','descriptions_number_unique']
remainnig_stock_codes_summary.columns remainnig_stock_codes_summary.reset_index()
stock_code_letters | unit_price_mean | unit_price_std | stock_codes_number | initial_descriptions_number_unique | descriptions_number_unique | |
---|---|---|---|---|---|---|
0 | has lowercase letter | 6.89 | 0.16 | 114 | 112 | 109 |
1 | has uppercase letter | 3.44 | 1.16 | 133 | 124 | 118 |
2 | without letters | 2.25 | 0.60 | 37 | 26 | 16 |
# checking descriptions without letters
'stock_code_letters =="without letters"').head(7)
descriptions_multiple_stock_codes_fixed_summary.query(
'stock_code_letters =="without letters"')
(descriptions_multiple_stock_codes_fixed_summary.query('description')['stock_code'].nunique()
.groupby(=False)
.sort_values(ascending
.reset_index()'stock_code > 1')) .query(
initial_description | description | stock_code | invoice_no_count | unit_price_mean | unit_price_std | stock_code_letters | |
---|---|---|---|---|---|---|---|
20 | BATHROOM METAL SIGN | BATHROOM METAL SIGN | 82580 | 635 | 0.83 | 0.34 | without letters |
21 | BATHROOM METAL SIGN | BATHROOM METAL SIGN | 21171 | 73 | 1.77 | 0.73 | without letters |
40 | CANNISTER VINTAGE LEAF DESIGN | ROUND STORAGE TIN VINTAGE LEAF | 23244 | 2 | 1.95 | 0.00 | without letters |
69 | COLOURING PENCILS BROWN TUBE | COLOURING PENCILS BROWN TUBE | 10133 | 196 | 0.65 | 0.25 | without letters |
70 | COLOURING PENCILS BROWN TUBE | COLOURING PENCILS BROWN TUBE | 10135 | 178 | 1.41 | 0.64 | without letters |
71 | COLUMBIAN CUBE CANDLE | COLUMBIAN CUBE CANDLE | 72134 | 11 | 0.99 | 0.45 | without letters |
72 | COLUMBIAN CANDLE RECTANGLE | COLUMBIAN CANDLE RECTANGLE | 72131 | 18 | 1.90 | 0.12 | without letters |
description | stock_code | |
---|---|---|
0 | COLUMBIAN CANDLE ROUND | 3 |
1 | BATHROOM METAL SIGN | 2 |
2 | COLOURING PENCILS BROWN TUBE | 2 |
3 | COLUMBIAN CANDLE RECTANGLE | 2 |
4 | COLUMBIAN CUBE CANDLE | 2 |
5 | FRENCH FLORAL CUSHION COVER | 2 |
6 | FRENCH LATTICE CUSHION COVER | 2 |
7 | FRENCH PAISLEY CUSHION COVER | 2 |
8 | FROSTED WHITE BASE | 2 |
9 | HEART T-LIGHT HOLDER | 2 |
10 | PAPER LANTERN 9 POINT SNOW STAR | 2 |
11 | PINK FLOCK GLASS CANDLEHOLDER | 2 |
12 | ROSE DU SUD CUSHION COVER | 2 |
13 | ROUND STORAGE TIN VINTAGE LEAF | 2 |
14 | SQUARE CHERRY BLOSSOM CABINET | 2 |
15 | WHITE BAMBOO RIBS LAMPSHADE | 2 |
'stock_code == "72133"')['description'].unique() df_ecom_filtered.query(
array(['COLUMBIAN CANDLE RECTANGLE'], dtype=object)
# checking several close stock codes, among remaining stock codes without letters
print('='*65)
print(f'\033[1mChecking descriptions of close stock codes:\033[0m')
print('-'*65)
for st_code in ['72131', '72132', '72133', '72134']:
= list(df_ecom_filtered.query('stock_code == @st_code')['description'].unique())
descr print(f'Stock code "{st_code}" descriptions: {descr}')
print('='*65)
=================================================================
Checking descriptions of close stock codes:
-----------------------------------------------------------------
Stock code "72131" descriptions: ['COLUMBIAN CANDLE RECTANGLE']
Stock code "72132" descriptions: ['COLUMBIAN CUBE CANDLE']
Stock code "72133" descriptions: ['COLUMBIAN CANDLE RECTANGLE']
Stock code "72134" descriptions: ['COLUMBIAN CUBE CANDLE']
=================================================================
Observations and Decisions
Addressing inconsistencies in stock codes and descriptions has greatly improved the accuracy of our analysis, leading to more reliable conclusions and recommendations.
💡 We can state the major insight - a stock code and description alone are not always sufficient for identifying a product, and consolidating stock codes of the same descriptions seems wrong. It appears reasonable to use a combination of a stock code and a description - as a comprehensive product indicator for further analyses. As an extra backing of this decision: stock codes with the same descriptions and vice versa represent different mean prices and even different price variations. We are not aware whether such cases stand for the same or different products in fact, and naming conventions are out of our reach.
⚠ Note: From now on, we will use the term “product” to refer to a combination of a stock code and a description.
⚠ Note: We don’t need to review our prior analysis after addressing naming inconsistencies, since those issues haven’t affected it. However, they could impact further study, so we have resolved them just in time.
# creating a `stock_code_description` column, representing composite keys of stock code + description
'stock_code_description'] = df_ecom_filtered['stock_code'] + "__" + df_ecom_filtered['description']
df_ecom_filtered[
print('\033[1m`stock_code_description` column examples:\033[0m')
'stock_code_description'].sample(2) df_ecom_filtered[
`stock_code_description` column examples:
407784 35911A__MULTICOLOUR RABBIT EGG WARMER
288654 21922__UNION STRIPE WITH FRINGE HAMMOCK
Name: stock_code_description, dtype: object
Let’s check the entries with negative quantities left unclassified, their descriptions and share of total.
= df_ecom_filtered.query('quantity < 0')
negative_qty_entries_remaining
5, random_state = 10)
negative_qty_entries_remaining.sample('description'].value_counts() negative_qty_entries_remaining[
invoice_no | stock_code | initial_description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | description | stock_code_description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
230550 | C557175 | 23084 | RABBIT NIGHT LIGHT | -1 | 2019-06-15 11:13:00 | 2.08 | 16170 | 2019 | 6 | 2019-06 | 24 | 2019-Week-24 | 2019-06-15 | 5 | Saturday | -2.08 | RABBIT NIGHT LIGHT | 23084__RABBIT NIGHT LIGHT |
70483 | C542078 | 22189 | CREAM HEART CARD HOLDER | -1 | 2019-01-23 12:11:00 | 3.95 | 12854 | 2019 | 1 | 2019-01 | 4 | 2019-Week-04 | 2019-01-23 | 2 | Wednesday | -3.95 | CREAM HEART CARD HOLDER | 22189__CREAM HEART CARD HOLDER |
515696 | C579781 | 22457 | NATURAL SLATE HEART CHALKBOARD | -1 | 2019-11-28 15:20:00 | 2.95 | 17451 | 2019 | 11 | 2019-11 | 48 | 2019-Week-48 | 2019-11-28 | 3 | Thursday | -2.95 | NATURAL SLATE HEART CHALKBOARD | 22457__NATURAL SLATE HEART CHALKBOARD |
218101 | C556011 | 23155 | KNICKERBOCKERGLORY MAGNET ASSORTED | -6 | 2019-06-06 11:45:00 | 0.83 | 14475 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-06 | 3 | Thursday | -4.98 | KNICKERBOCKERGLORY MAGNET ASSORTED | 23155__KNICKERBOCKERGLORY MAGNET ASS... |
132976 | C547711 | 22692 | DOORMAT WELCOME TO OUR HOME | -1 | 2019-03-22 19:31:00 | 7.95 | 13534 | 2019 | 3 | 2019-03 | 12 | 2019-Week-12 | 2019-03-22 | 4 | Friday | -7.95 | DOORMAT WELCOME TO OUR HOME | 22692__DOORMAT WELCOME TO OUR HOME |
description
REGENCY CAKESTAND 3 TIER 134
JAM MAKING SET WITH JARS 73
SET OF 3 CAKE TINS PANTRY DESIGN 59
STRAWBERRY CERAMIC TRINKET BOX 54
POPCORN HOLDER 46
...
FIRST AID TIN 1
DOOR HANGER MUM + DADS ROOM 1
STRAWBERRY HONEYCOMB GARLAND 1
ENGLISH ROSE SCENTED HANGING FLOWER 1
LARGE HANGING IVORY & RED WOOD BIRD 1
Name: count, Length: 1445, dtype: int64
=True) share_evaluation(negative_qty_entries_remaining, df_ecom, show_qty_rev
======================================================================================================================================================
Evaluation of share: negative_qty_entries_remaining
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 5620 (1.1% of all entries)
Quantity: -56247 (1.1% of the total quantity)
Revenue: -80997.5 (0.8% of the total revenue)
======================================================================================================================================================
Observations
We see that remaining entries with negative quantities account for 1.1% of all entries, 1.1% of the total quantity and 0.8% of the total revenue.
Taking into account data cleaning that has been already performed, the nature of these data must be following:
Decisions
Implementation of Decisions
# getting rid of remaining entries with negative quantities
= lambda df: df.query('quantity >= 0')
operation = data_reduction(df_ecom_filtered, operation) df_ecom_filtered
Number of entries cleaned out from the "df_ecom_filtered": 5620 (1.1%)
In this section, we will analyze high-volume items in three ways:
Note: given a substantial (236%) average coefficient of variation of quantity among stock codes, wholesale entries likely are not equal to overall entries of high-volume products, so we study them separately.
Wholesale Purchases# checking wholesale purchases - top 5% by quantity volume
= np.percentile(df_ecom_filtered['quantity'], 95)
wholesale_threshold = df_ecom_filtered.query('quantity > @wholesale_threshold').sort_values(by='quantity', ascending=False)
wholesale_purchases
print('='*113)
print(f'\033[1mWe consider wholesale purchases as entries with more than {wholesale_threshold :.0f} items\033[0m (top 5% by quantity volume across all entries)')
print('='*113)
=================================================================================================================
We consider wholesale purchases as entries with more than 30 items (top 5% by quantity volume across all entries)
=================================================================================================================
# checking the share of wholesale purchases according to quantity amounts
share_evaluation(wholesale_purchases, df_ecom_filtered, =True,
show_qty_rev=True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code_description', 'nunique'): 'Products Coverage',
('customer_id', 'nunique'): 'Customers Coverage'},
(=True,
show_pie_charts_notes=True,
show_boxplots=True, example_type='head', example_limit=3) show_example
======================================================================================================================================================
Evaluation of share: wholesale_purchases
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 25606 (4.9% of all entries)
Quantity: 2454459 (45.3% of the total quantity)
Revenue: 3535221.0 (35.3% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into wholesale_purchases
.df_ecom_filtered
is generated in wholesale_purchases
.df_ecom_filtered
occurs in wholesale_purchases
. Every entry is counted separately, even if they are associated with the same order.wholesale_purchases
, it still counts as one full unique order in this chart.wholesale_purchases
, it still counts as one full unique product in this chart.wholesale_purchases
, they still count as one full unique customer in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
Top rows:
invoice_no stock_code initial_description quantity invoice_date unit_price customer_id invoice_year invoice_month \
502122 578841 84826 ASSTD DESIGN 3D PAPER STICKERS 12540 2019-11-23 15:57:00 0.00 13256 2019 11
421632 573008 84077 WORLD WAR 2 GLIDERS ASSTD DESIGNS 4800 2019-10-25 12:26:00 0.21 12901 2019 10
206121 554868 22197 SMALL POPCORN HOLDER 4300 2019-05-25 10:52:00 0.72 13135 2019 5
invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue \
502122 2019-11 47 2019-Week-47 2019-11-23 5 Saturday 0.00
421632 2019-10 43 2019-Week-43 2019-10-25 4 Friday 1008.00
206121 2019-05 21 2019-Week-21 2019-05-25 5 Saturday 3096.00
description stock_code_description
502122 ASSTD DESIGN 3D PAPER STICKERS 84826__ASSTD DESIGN 3D PAPER STICKERS
421632 WORLD WAR 2 GLIDERS ASSTD DESIGNS 84077__WORLD WAR 2 GLIDERS ASSTD DES...
206121 POPCORN HOLDER 22197__POPCORN HOLDER
======================================================================================================================================================
We see that one top quantity entry represents zero unit price and zero revenue consequently. Let’s examine other zero unit price entries.
'unit_price==0'), df_ecom_filtered, show_qty_rev=True, show_example=True) share_evaluation(wholesale_purchases.query(
======================================================================================================================================================
Evaluation of share: the data slice mentioned in the call function
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 23 (0.0% of all entries)
Quantity: 16172 (0.3% of the total quantity)
Revenue: 0.0 (0.0% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code initial_description quantity invoice_date unit_price customer_id invoice_year invoice_month \
117892 546406 46000S POLYESTER FILLER PAD 40x40cm 70 2019-03-09 16:21:00 0.00 0 2019 3
117893 546406 46000M POLYESTER FILLER PAD 45x45cm 60 2019-03-09 16:21:00 0.00 0 2019 3
228691 556939 46000S POLYESTER FILLER PAD 40x40cm 160 2019-06-13 16:34:00 0.00 0 2019 6
314748 564651 21786 POLKADOT RAIN HAT 144 2019-08-24 14:19:00 0.00 14646 2019 8
198383 554037 22619 SET OF 6 SOLDIER SKITTLES 80 2019-05-18 14:13:00 0.00 12415 2019 5
invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue description \
117892 2019-03 10 2019-Week-10 2019-03-09 5 Saturday 0.00 POLYESTER FILLER PAD 40x40cm
117893 2019-03 10 2019-Week-10 2019-03-09 5 Saturday 0.00 POLYESTER FILLER PAD 45x45cm
228691 2019-06 24 2019-Week-24 2019-06-13 3 Thursday 0.00 POLYESTER FILLER PAD 40x40cm
314748 2019-08 34 2019-Week-34 2019-08-24 5 Saturday 0.00 POLKADOT RAIN HAT
198383 2019-05 20 2019-Week-20 2019-05-18 5 Saturday 0.00 SET OF 6 SOLDIER SKITTLES
stock_code_description
117892 46000S__POLYESTER FILLER PAD 40x40cm
117893 46000M__POLYESTER FILLER PAD 45x45cm
228691 46000S__POLYESTER FILLER PAD 40x40cm
314748 21786__POLKADOT RAIN HAT
198383 22619__SET OF 6 SOLDIER SKITTLES
======================================================================================================================================================
Observations and Decisions
It seems that zero unit price entries are primarily associated with data corrections, as evidenced by descriptions like “check” and “Adjustment”. Such operations represent negligible share of entries and less than 1% of quantity. They are inessential for further product analyses, so we can remove them to reduce noise in our analyses.
Later on in frames of Unit Price Distribution Analysis we will study all the cases of zero unit prices (not only for wholesale entries) and will decide how to address them.
Implementation of Decisions
# cleaning out zero unit price entries from `wholesale_purchases`
= lambda df: df.query('unit_price != 0')
operation = data_reduction(wholesale_purchases, operation) wholesale_purchases
Number of entries cleaned out from the "wholesale_purchases": 23 (0.1%)
Let’s examine the cleaned DataFrame of wholesale purchases.
# checking the share of cleaned DataFrame of wholesale purchases
share_evaluation(wholesale_purchases, df_ecom_filtered, =True,
show_qty_rev=True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code_description', 'nunique'): 'Products Coverage',
('customer_id', 'nunique'): 'Customers Coverage'},
(=True,
show_pie_charts_notes=True,
show_boxplots=False) show_example
======================================================================================================================================================
Evaluation of share: wholesale_purchases
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 25583 (4.9% of all entries)
Quantity: 2438287 (45.0% of the total quantity)
Revenue: 3535221.0 (35.3% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into wholesale_purchases
.df_ecom_filtered
is generated in wholesale_purchases
.df_ecom_filtered
occurs in wholesale_purchases
. Every entry is counted separately, even if they are associated with the same order.wholesale_purchases
, it still counts as one full unique order in this chart.wholesale_purchases
, it still counts as one full unique product in this chart.wholesale_purchases
, they still count as one full unique customer in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
======================================================================================================================================================
# studying quantity distribution in wholesale purchases
=wholesale_purchases, parameter='quantity', x_limits=[0, 500], bins=[100, 400], speed_up_plotting=True, target_sample=5000, outliers_info=False) distribution_IQR(df
Note: A sample data slice 20% of "wholesale_purchases" was used for histogram plotting instead of the full DataFrame.
This significantly reduced plotting time for the large dataset. The accuracy of the visualization might be slightly reduced, meanwhile it should be sufficient for exploratory analysis.
==================================================
Statistics on quantity
in wholesale_purchases
count 25583.00
mean 95.31
std 144.50
min 31.00
25% 44.00
50% 50.00
75% 100.00
max 4800.00
Name: quantity, dtype: float64
--------------------------------------------------
The distribution is extremely skewed to the right
(skewness: 10.3)
Note: outliers affect skewness calculation
==================================================
Let’s examine customers’ repetitions of wholesale purchases of same products. For this purpose we will group our wholesale entries on products, and calculate unique_invoices_per_customer_avg
metric. Since ~25% of entries contain unknown customers (customer_id
- “0”), we will filter them out, otherwise they will affect our calculations (all unknown customers will act as one unique customer).
# aggregating data by product
= (
wholesale_purchases_products_summary_known_customers 'customer_id != "0"')
wholesale_purchases.query('stock_code_description'])
.groupby(['quantity': 'sum',
.agg({'revenue': 'sum',
'invoice_no': 'nunique',
'customer_id': 'nunique'})
.reset_index()round(1)
).
= ['stock_code_description',
wholesale_purchases_products_summary_known_customers.columns 'quantity',
'revenue',
'unique_invoices',
'unique_customers']
'unique_invoices_per_customer_avg'] = round(
wholesale_purchases_products_summary_known_customers['unique_invoices'] / wholesale_purchases_products_summary_known_customers['unique_customers'],
wholesale_purchases_products_summary_known_customers[2)
# checking the results
print('='*table_width)
print(f'\033[1mDataFrame `wholesale_purchases_products_summary_known_customers`:\033[0m')
wholesale_purchases_products_summary_known_customersprint('-'*table_width)
print(f'\033[1mDescriptive statistics on wholesale purchases with identified customers grouped by product:\033[0m')
'unique_customers', 'unique_invoices_per_customer_avg']].describe()
wholesale_purchases_products_summary_known_customers[[print('='*table_width)
======================================================================================================================================================
DataFrame `wholesale_purchases_products_summary_known_customers`:
stock_code_description | quantity | revenue | unique_invoices | unique_customers | unique_invoices_per_customer_avg | |
---|---|---|---|---|---|---|
0 | 10002__INFLATABLE POLITICAL GLOBE | 446 | 379.10 | 6 | 4 | 1.50 |
1 | 10080__GROOVY CACTUS INFLATABLE | 48 | 18.70 | 1 | 1 | 1.00 |
2 | 10125__MINI FUNKY DESIGN TAPES | 590 | 458.50 | 8 | 4 | 2.00 |
3 | 10133__COLOURING PENCILS BROWN TUBE | 949 | 428.70 | 15 | 13 | 1.15 |
4 | 10135__COLOURING PENCILS BROWN TUBE | 926 | 682.70 | 13 | 11 | 1.18 |
... | ... | ... | ... | ... | ... | ... |
2145 | 90209B__GREEN ENAMEL+GLASS HAIR COMB | 84 | 147.00 | 2 | 1 | 2.00 |
2146 | 90209C__PINK ENAMEL+GLASS HAIR COMB | 204 | 357.00 | 3 | 1 | 3.00 |
2147 | 90210C__RED ACRYLIC FACETED BANGLE | 60 | 75.00 | 1 | 1 | 1.00 |
2148 | 90210D__PURPLE ACRYLIC FACETED BANGLE | 60 | 75.00 | 1 | 1 | 1.00 |
2149 | 90214Y__LETTER "Y" BLING KEY RING | 48 | 13.90 | 1 | 1 | 1.00 |
2150 rows × 6 columns
------------------------------------------------------------------------------------------------------------------------------------------------------
Descriptive statistics on wholesale purchases with identified customers grouped by product:
unique_customers | unique_invoices_per_customer_avg | |
---|---|---|
count | 2150.00 | 2150.00 |
mean | 7.26 | 1.41 |
std | 13.70 | 0.62 |
min | 1.00 | 1.00 |
25% | 1.00 | 1.00 |
50% | 3.00 | 1.17 |
75% | 8.00 | 1.60 |
max | 302.00 | 7.00 |
======================================================================================================================================================
Observations
invoices_per_customer_avg
parameter among wholesale purchases grouped by product.
High-Volume Products
We will define high-volume products as products in the top 5% by total quantity across all products. We will begin this investigation by calculating metrics aggregated by products, some of which will also be utilized in upcoming analysis steps. We will primarily use medians rather than means, as they better represent typical values, for instance, given the substantial coefficient of variation in quantity among stock codes (236%)
# aggregating data by products
= (
products_summary 'stock_code_description')
df_ecom_filtered.groupby(= ('quantity', 'sum'),
.agg(quantity= ('revenue', 'sum'),
revenue= ('quantity', 'median'),
quantity_median= ('revenue', 'median'),
revenue_median= ('unit_price', 'median'),
unit_price_median= ('invoice_no', 'count'),
invoices_count= ('invoice_no', 'nunique'),
unique_invoices= ('customer_id', 'nunique'))
unique_customers='quantity', ascending=False)
.sort_values(by
.reset_index())
#adding customers share column
= df_ecom_filtered['customer_id'].nunique()
unique_customers_total 'customer_range_share'] = products_summary['unique_customers']/unique_customers_total
products_summary[
#checking result
products_summary
stock_code_description | quantity | revenue | quantity_median | revenue_median | unit_price_median | invoices_count | unique_invoices | unique_customers | customer_range_share | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 22197__POPCORN HOLDER | 56898 | 51334.47 | 12.00 | 10.20 | 0.85 | 1418 | 1392 | 408 | 0.09 |
1 | 84077__WORLD WAR 2 GLIDERS ASSTD DES... | 54951 | 13814.01 | 48.00 | 13.92 | 0.29 | 536 | 535 | 308 | 0.07 |
2 | 85099B__JUMBO BAG RED RETROSPOT | 48375 | 94159.81 | 10.00 | 20.80 | 2.08 | 2112 | 2092 | 636 | 0.15 |
3 | 85123A__WHITE HANGING HEART T-LIGHT ... | 37584 | 104284.24 | 6.00 | 17.70 | 2.95 | 2248 | 2193 | 857 | 0.20 |
4 | 21212__PACK OF 72 RETROSPOT CAKE CASES | 36396 | 21246.45 | 24.00 | 13.20 | 0.55 | 1352 | 1320 | 636 | 0.15 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3916 | 23609__SET 10 CARDS SNOWY ROBIN 17099 | 1 | 2.91 | 1.00 | 2.91 | 2.91 | 1 | 1 | 1 | 0.00 |
3917 | 84977__WIRE FLOWER T-LIGHT HOLDER | 1 | 1.25 | 1.00 | 1.25 | 1.25 | 1 | 1 | 1 | 0.00 |
3918 | 23602__SET 10 CARDS 3 WISE MEN 17107 | 1 | 2.91 | 1.00 | 2.91 | 2.91 | 1 | 1 | 1 | 0.00 |
3919 | 22016__Dotcomgiftshop Gift Voucher £... | 1 | 83.33 | 1.00 | 83.33 | 83.33 | 1 | 1 | 1 | 0.00 |
3920 | 51014c__FEATHER PEN,COAL BLACK | 1 | 0.83 | 1.00 | 0.83 | 0.83 | 1 | 1 | 1 | 0.00 |
3921 rows × 10 columns
# calculating threshold for the top quantity per product
= round(np.percentile(products_summary['quantity'], 95), 2)
products_quantity_top_threshold products_quantity_top_threshold
6013.0
# defining the high-volume products
= products_summary.query('quantity > @products_quantity_top_threshold')
high_volume_products_summary
# evaluating median quantity
= high_volume_products_summary['quantity_median'].median()
high_volume_products_quantity_median = products_summary['quantity_median'].median()
general_quantity_median
print('='*143)
print(f'\033[1mWe consider high-volume products as those with total quantity volume more than '
f'{products_quantity_top_threshold:0.0f}\033[0m (within the top 5% of total quantity range of all products)\n'
f'\033[1mThe median of median quantities per purchase for high-volume products is {high_volume_products_quantity_median:0.1f}, which is '
f'{high_volume_products_quantity_median / general_quantity_median:0.1f} times higher than that of a typical product ({general_quantity_median:0.1f})\033[0m')
print('='*143)
===============================================================================================================================================
We consider high-volume products as those with total quantity volume more than 6013 (within the top 5% of total quantity range of all products)
The median of median quantities per purchase for high-volume products is 8.0, which is 4.0 times higher than that of a typical product (2.0)
===============================================================================================================================================
# checking the share of entries associated with the high-volume products
= high_volume_products_summary['stock_code_description'].tolist()
high_volume_products_list = df_ecom_filtered.query('stock_code_description in @high_volume_products_list')
high_volume_products_entries
share_evaluation(high_volume_products_entries, df_ecom_filtered, = True,
show_qty_rev =True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code_description', 'nunique'): 'Products Coverage',
('customer_id', 'nunique'): 'Customers Coverage'},
(=True,
show_pie_charts_notes=True,
show_boxplots=True,
show_outliers=True, example_type='sample', example_limit=5) show_example
======================================================================================================================================================
Evaluation of share: high_volume_products_entries
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 134358 (25.7% of all entries)
Quantity: 2272733 (41.9% of the total quantity)
Revenue: 3507257.6 (35.1% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into high_volume_products_entries
.df_ecom_filtered
is generated in high_volume_products_entries
.df_ecom_filtered
occurs in high_volume_products_entries
. Every entry is counted separately, even if they are associated with the same order.high_volume_products_entries
, it still counts as one full unique order in this chart.high_volume_products_entries
, it still counts as one full unique product in this chart.high_volume_products_entries
, they still count as one full unique customer in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code initial_description quantity invoice_date unit_price customer_id invoice_year invoice_month \
225189 556636 23308 SET OF 60 VINTAGE LEAF CAKE CASES 1 2019-06-11 15:30:00 1.25 0 2019 6
58133 541221 22356 CHARLOTTE BAG PINK POLKADOT 29 2019-01-12 14:28:00 2.46 0 2019 1
424808 573286 22791 T-LIGHT GLASS FLUTED ANTIQUE 6 2019-10-26 14:38:00 1.25 0 2019 10
277816 561195 23308 SET OF 60 VINTAGE LEAF CAKE CASES 24 2019-07-23 13:57:00 0.55 14796 2019 7
253723 559169 23230 WRAP ALPHABET DESIGN 50 2019-07-04 17:25:00 0.42 16722 2019 7
invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue \
225189 2019-06 24 2019-Week-24 2019-06-11 1 Tuesday 1.25
58133 2019-01 2 2019-Week-02 2019-01-12 5 Saturday 71.34
424808 2019-10 43 2019-Week-43 2019-10-26 5 Saturday 7.50
277816 2019-07 30 2019-Week-30 2019-07-23 1 Tuesday 13.20
253723 2019-07 27 2019-Week-27 2019-07-04 3 Thursday 21.00
description stock_code_description
225189 SET OF 60 VINTAGE LEAF CAKE CASES 23308__SET OF 60 VINTAGE LEAF CAKE C...
58133 CHARLOTTE BAG PINK POLKADOT 22356__CHARLOTTE BAG PINK POLKADOT
424808 T-LIGHT GLASS FLUTED ANTIQUE 22791__T-LIGHT GLASS FLUTED ANTIQUE
277816 SET OF 60 VINTAGE LEAF CAKE CASES 23308__SET OF 60 VINTAGE LEAF CAKE C...
253723 WRAP ALPHABET DESIGN 23230__WRAP ALPHABET DESIGN
======================================================================================================================================================
Top High-Volume Products
Let’s analyze top high-volume products. We will examine their product categories, to understand what types of items they represent. And also we will study their revenue and number of orders (unique invoices), to understand their overall business impact.
# defining top 10 high-volume products
= high_volume_products_summary.sort_values(by='quantity', ascending=False).head(10)
top_10_high_volume_products_summary top_10_high_volume_products_summary.head()
stock_code_description | quantity | revenue | quantity_median | revenue_median | unit_price_median | invoices_count | unique_invoices | unique_customers | customer_range_share | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 22197__POPCORN HOLDER | 56898 | 51334.47 | 12.00 | 10.20 | 0.85 | 1418 | 1392 | 408 | 0.09 |
1 | 84077__WORLD WAR 2 GLIDERS ASSTD DES... | 54951 | 13814.01 | 48.00 | 13.92 | 0.29 | 536 | 535 | 308 | 0.07 |
2 | 85099B__JUMBO BAG RED RETROSPOT | 48375 | 94159.81 | 10.00 | 20.80 | 2.08 | 2112 | 2092 | 636 | 0.15 |
3 | 85123A__WHITE HANGING HEART T-LIGHT ... | 37584 | 104284.24 | 6.00 | 17.70 | 2.95 | 2248 | 2193 | 857 | 0.20 |
4 | 21212__PACK OF 72 RETROSPOT CAKE CASES | 36396 | 21246.45 | 24.00 | 13.20 | 0.55 | 1352 | 1320 | 636 | 0.15 |
# checking the share of top 10 high-volume products
= top_10_high_volume_products_summary['stock_code_description'].tolist()
top_10_high_volume_products_list = high_volume_products_entries.query('stock_code_description in @top_10_high_volume_products_list')
top_10_high_volume_products_entries
share_evaluation(top_10_high_volume_products_entries, df_ecom_filtered, = True,
show_qty_rev =True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code_description', 'nunique'): 'Products Coverage',
('customer_id', 'nunique'): 'Customers Coverage'},
(=True,
show_pie_charts_notes=True, boxplots_parameter = 'stock_code_description', show_outliers=False,
show_boxplots=False, example_type='sample', example_limit=3) show_example
======================================================================================================================================================
Evaluation of share: top_10_high_volume_products_entries
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 11943 (2.3% of all entries)
Quantity: 379081 (7.0% of the total quantity)
Revenue: 447776.8 (4.5% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into top_10_high_volume_products_entries
.df_ecom_filtered
is generated in top_10_high_volume_products_entries
.df_ecom_filtered
occurs in top_10_high_volume_products_entries
. Every entry is counted separately, even if they are associated with the same order.top_10_high_volume_products_entries
, it still counts as one full unique order in this chart.top_10_high_volume_products_entries
, it still counts as one full unique product in this chart.top_10_high_volume_products_entries
, they still count as one full unique customer in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
======================================================================================================================================================
Let’s visualize the main metrics of the top 10 high-volume products: quantity, revenue and number of purchases.
# creating figure having secondary y-axis
= make_subplots(specs=[[{'secondary_y': True}]])
fig
# adding bar charts for quantity and revenue
fig.add_trace(=top_10_high_volume_products_summary['stock_code_description'], y=top_10_high_volume_products_summary['quantity'], name='Quantity', marker_color='teal', opacity=0.7),
go.Bar(x=False)
secondary_y
fig.add_trace(=top_10_high_volume_products_summary['stock_code_description'], y=top_10_high_volume_products_summary['revenue'], name='Revenue', marker_color='darkred', opacity=0.7),
go.Bar(x=False)
secondary_y
# adding line plots with markers for number of entries
fig.add_trace(
go.Scatter(=top_10_high_volume_products_summary['stock_code_description'], y=top_10_high_volume_products_summary['invoices_count'], name='Entries', line={'color': 'purple', 'width': 3}, mode='lines+markers', marker={'size': 8}),
x=True)
secondary_y
# updating layout and axes
fig.update_layout(={'text': 'Top 10 High-Volume Products: Quantity, and Purchases (Entries)', 'font_size': 20, 'y': 0.95, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'},
title='group',
barmode='Description',
xaxis_title=45,
xaxis_tickangle={'orientation': 'h', 'yanchor': 'bottom', 'y': 1.02, 'xanchor': 'right', 'x': 1},
legend=750,
height=1200)
width
='Quantity & Revenue', secondary_y=False)
fig.update_yaxes(title_text
fig.update_yaxes(='Entries',
title_text={'color': 'purple'},
tickfont={'color': 'purple'},
titlefont=True)
secondary_y
; fig.show()
Additionally, let’s display the quantity totals and distributions of top-selling products, here we will consider twice as many products for better overview.
# examination of quantity totals and distributions of top-selling products
'stock_code_description', 'quantity', show_outliers=True, n_items=20) plot_totals_distribution(df_ecom_filtered,
Observations
Overall high-volume products
The top 10 high-volume products
💡 The top 10 high-volume products (representing just 0.26% of the total products range) alone generate 2.3% of all purchases, and contribute 7% of the total quantity and 4.5% of the total revenue.
💡 Interestingly these top 10 high-volume products are extremely popular, reaching ~57% of all customers (that purchased at least one of these products). Four products even reached 15-20% of customers each.
We can see a significant variation in purchases per product. Highest purchases frequency is seen for “JUMBO BAG RED RETROSPOT” (~ 2100 purchases) and WHITE HANGING HEART T-LIGHT HOLDER (~ 2250 purchases), while most products generated between 400-1500 purchases.
The box plots reveal significant variability in purchase quantity across products:
💡 Top sold products represent various categories, like storage solutions (bags, cases, holders), and home decor goods (paint sets, night lights, tissues), etc.
In the next steps, we will try to categorize a broader range of products, though the variety and complexity of descriptions might make it challenging, or even impossible.
💡 Overall, from the visualizations of the key metrics, we can conclude that there are different ways products succeed: some through high sales volume, others through high revenue (high prices and sufficient, but not always highest quantity sold), and some through frequent purchases.
Let’s examine the customers with the highest purchase volumes. We define high-volume customers as those whose purchase volume falls within top 5% of all customers. For this study we will first create a DataFrame summarizing the main parameters by customers, excluding entries with missing customer ids (zero value) from the current analysis. Then we will define top performers.
# aggregating data by customers
= (
customers_summary 'customer_id != "0"') # excluding entries with missing customer ids
df_ecom_filtered.query('customer_id')
.groupby(
.agg(= ('quantity', 'sum'),
quantity= ('revenue', 'sum'),
revenue= ('unit_price', 'mean'),
unit_price_mean= ('unit_price', 'median'),
unit_price_median= ('invoice_no', 'count'),
invoices_count= ('invoice_no', 'nunique'),
unique_invoices= ('stock_code_description', 'nunique'))
unique_products
.reset_index()='quantity', ascending=False))
.sort_values(by
#adding extra columns
= df_ecom_filtered['stock_code_description'].nunique()
unique_products_total 'product_range_share'] = (customers_summary['unique_products']/unique_products_total)
customers_summary['entries_per_invoice_avg'] = customers_summary['invoices_count']/customers_summary['unique_invoices']
customers_summary[
10) customers_summary.head(
customer_id | quantity | revenue | unit_price_mean | unit_price_median | invoices_count | unique_invoices | unique_products | product_range_share | entries_per_invoice_avg | |
---|---|---|---|---|---|---|---|---|---|---|
1689 | 14646 | 197420 | 279138.02 | 2.39 | 1.45 | 2064 | 73 | 703 | 0.18 | 28.27 |
1879 | 14911 | 80404 | 136161.83 | 3.33 | 2.08 | 5586 | 198 | 1785 | 0.46 | 28.21 |
54 | 12415 | 77669 | 124564.53 | 2.44 | 1.65 | 715 | 20 | 443 | 0.11 | 35.75 |
3725 | 17450 | 69973 | 194390.79 | 3.38 | 2.55 | 336 | 46 | 124 | 0.03 | 7.30 |
3768 | 17511 | 64549 | 91062.38 | 2.31 | 1.65 | 963 | 31 | 454 | 0.12 | 31.06 |
4197 | 18102 | 64124 | 259657.30 | 4.50 | 4.27 | 431 | 60 | 150 | 0.04 | 7.18 |
996 | 13694 | 63312 | 65039.62 | 1.57 | 1.25 | 568 | 50 | 366 | 0.09 | 11.36 |
1434 | 14298 | 58343 | 51527.30 | 1.50 | 1.04 | 1637 | 44 | 884 | 0.23 | 37.20 |
1333 | 14156 | 57755 | 116560.08 | 3.40 | 2.10 | 1382 | 54 | 713 | 0.18 | 25.59 |
3174 | 16684 | 50255 | 66653.56 | 2.45 | 1.65 | 277 | 28 | 119 | 0.03 | 9.89 |
# calculating the top quantity threshold
= round(np.percentile(customers_summary['quantity'], 95), 0)
high_volume_customers_qty_threshold high_volume_customers_qty_threshold
3536.0
# defining high-volume customers - as the top 5% by quantity volume
= customers_summary.query('quantity > @high_volume_customers_qty_threshold').sort_values(by='quantity', ascending=False)
high_volume_customers_summary = high_volume_customers_summary['customer_id'].tolist()
high_volume_customers_list
= df_ecom_filtered.query('customer_id in @high_volume_customers_list')
high_volume_customers_entries
print('='*131)
print(f'\033[1mWe consider high-volume customers as those who purchased more than {high_volume_customers_qty_threshold:.0f} items in total (the top 5% of customers by quantity volume)\033[0m')
print('-'*131)
print()
print(f'\033[1mDescriptive statistics on purchases made by high-volume customers:\033[0m')
'quantity', 'revenue']].describe()
high_volume_customers_entries[[print('='*131)
===================================================================================================================================
We consider high-volume customers as those who purchased more than 3536 items in total (the top 5% of customers by quantity volume)
-----------------------------------------------------------------------------------------------------------------------------------
Descriptive statistics on purchases made by high-volume customers:
quantity | revenue | |
---|---|---|
count | 102199.00 | 102199.00 |
mean | 23.81 | 39.08 |
std | 86.70 | 129.01 |
min | 1.00 | 0.00 |
25% | 2.00 | 5.04 |
50% | 8.00 | 15.00 |
75% | 20.00 | 29.70 |
max | 12540.00 | 7144.72 |
===================================================================================================================================
# checking the share of purchases made by high-volume customers
share_evaluation(high_volume_customers_entries, df_ecom_filtered, =True,
show_qty_rev=True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code_description', 'nunique'): 'Products Coverage',
('customer_id', 'nunique'): 'Customers Coverage'},
(=True,
show_pie_charts_notes=True,
show_boxplots=True, example_type='head', example_limit=5) show_example
======================================================================================================================================================
Evaluation of share: high_volume_customers_entries
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 102199 (19.5% of all entries)
Quantity: 2433486 (44.9% of the total quantity)
Revenue: 3994168.4 (39.9% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into high_volume_customers_entries
.df_ecom_filtered
is generated in high_volume_customers_entries
.df_ecom_filtered
occurs in high_volume_customers_entries
. Every entry is counted separately, even if they are associated with the same order.high_volume_customers_entries
, it still counts as one full unique order in this chart.high_volume_customers_entries
, it still counts as one full unique product in this chart.high_volume_customers_entries
, they still count as one full unique customer in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
Top rows:
invoice_no stock_code initial_description quantity invoice_date unit_price customer_id invoice_year invoice_month \
26 536370 22728 ALARM CLOCK BAKELIKE PINK 24 2018-11-29 08:45:00 3.75 12583 2018 11
27 536370 22727 ALARM CLOCK BAKELIKE RED 24 2018-11-29 08:45:00 3.75 12583 2018 11
28 536370 22726 ALARM CLOCK BAKELIKE GREEN 12 2018-11-29 08:45:00 3.75 12583 2018 11
29 536370 21724 PANDA AND BUNNIES STICKER SHEET 12 2018-11-29 08:45:00 0.85 12583 2018 11
30 536370 21883 STARS GIFT TAPE 24 2018-11-29 08:45:00 0.65 12583 2018 11
invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue description \
26 2018-11 48 2018-Week-48 2018-11-29 3 Thursday 90.00 ALARM CLOCK BAKELIKE PINK
27 2018-11 48 2018-Week-48 2018-11-29 3 Thursday 90.00 ALARM CLOCK BAKELIKE RED
28 2018-11 48 2018-Week-48 2018-11-29 3 Thursday 45.00 ALARM CLOCK BAKELIKE GREEN
29 2018-11 48 2018-Week-48 2018-11-29 3 Thursday 10.20 PANDA AND BUNNIES STICKER SHEET
30 2018-11 48 2018-Week-48 2018-11-29 3 Thursday 15.60 STARS GIFT TAPE
stock_code_description
26 22728__ALARM CLOCK BAKELIKE PINK
27 22727__ALARM CLOCK BAKELIKE RED
28 22726__ALARM CLOCK BAKELIKE GREEN
29 21724__PANDA AND BUNNIES STICKER SHEET
30 21883__STARS GIFT TAPE
======================================================================================================================================================
Let’s also check a volume and share of purchases where customers are not identified.
= df_ecom_filtered.query('customer_id == "0"')
entries_without_customer
share_evaluation(entries_without_customer, df_ecom_filtered, = True,
show_qty_rev =True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code_description', 'nunique'): 'Products Coverage',
('customer_id', 'nunique'): 'Customers Coverage'},
(=True,
show_pie_charts_notes=True,
show_boxplots=True, example_type='head', example_limit=5) show_example
======================================================================================================================================================
Evaluation of share: entries_without_customer
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 131796 (25.2% of all entries)
Quantity: 422806 (7.8% of the total quantity)
Revenue: 1510677.5 (15.1% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into entries_without_customer
.df_ecom_filtered
is generated in entries_without_customer
.df_ecom_filtered
occurs in entries_without_customer
. Every entry is counted separately, even if they are associated with the same order.entries_without_customer
, it still counts as one full unique order in this chart.entries_without_customer
, it still counts as one full unique product in this chart.entries_without_customer
, they still count as one full unique customer in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
Top rows:
invoice_no stock_code initial_description quantity invoice_date unit_price customer_id invoice_year invoice_month \
1443 536544 21773 DECORATIVE ROSE BATHROOM BOTTLE 1 2018-11-29 14:32:00 2.51 0 2018 11
1444 536544 21774 DECORATIVE CATS BATHROOM BOTTLE 2 2018-11-29 14:32:00 2.51 0 2018 11
1445 536544 21786 POLKADOT RAIN HAT 4 2018-11-29 14:32:00 0.85 0 2018 11
1446 536544 21787 RAIN PONCHO RETROSPOT 2 2018-11-29 14:32:00 1.66 0 2018 11
1447 536544 21790 VINTAGE SNAP CARDS 9 2018-11-29 14:32:00 1.66 0 2018 11
invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue description \
1443 2018-11 48 2018-Week-48 2018-11-29 3 Thursday 2.51 DECORATIVE ROSE BATHROOM BOTTLE
1444 2018-11 48 2018-Week-48 2018-11-29 3 Thursday 5.02 DECORATIVE CATS BATHROOM BOTTLE
1445 2018-11 48 2018-Week-48 2018-11-29 3 Thursday 3.40 POLKADOT RAIN HAT
1446 2018-11 48 2018-Week-48 2018-11-29 3 Thursday 3.32 RAIN PONCHO RETROSPOT
1447 2018-11 48 2018-Week-48 2018-11-29 3 Thursday 14.94 VINTAGE SNAP CARDS
stock_code_description
1443 21773__DECORATIVE ROSE BATHROOM BOTTLE
1444 21774__DECORATIVE CATS BATHROOM BOTTLE
1445 21786__POLKADOT RAIN HAT
1446 21787__RAIN PONCHO RETROSPOT
1447 21790__VINTAGE SNAP CARDS
======================================================================================================================================================
Top High-Volume Customers
# defining top 10 high-volume customers
= high_volume_customers_summary.sort_values(by='quantity', ascending=False).head(10)
top_10_high_volume_customers_summary top_10_high_volume_customers_summary.head()
customer_id | quantity | revenue | unit_price_mean | unit_price_median | invoices_count | unique_invoices | unique_products | product_range_share | entries_per_invoice_avg | |
---|---|---|---|---|---|---|---|---|---|---|
1689 | 14646 | 197420 | 279138.02 | 2.39 | 1.45 | 2064 | 73 | 703 | 0.18 | 28.27 |
1879 | 14911 | 80404 | 136161.83 | 3.33 | 2.08 | 5586 | 198 | 1785 | 0.46 | 28.21 |
54 | 12415 | 77669 | 124564.53 | 2.44 | 1.65 | 715 | 20 | 443 | 0.11 | 35.75 |
3725 | 17450 | 69973 | 194390.79 | 3.38 | 2.55 | 336 | 46 | 124 | 0.03 | 7.30 |
3768 | 17511 | 64549 | 91062.38 | 2.31 | 1.65 | 963 | 31 | 454 | 0.12 | 31.06 |
# checking the share of top 10 high-volume customers
= top_10_high_volume_customers_summary['customer_id'].tolist()
top_10_high_volume_customers_list = high_volume_customers_entries.query('customer_id in @top_10_high_volume_customers_list')
top_10_high_volume_customers_entries
share_evaluation(top_10_high_volume_customers_entries, df_ecom_filtered, = True,
show_qty_rev =True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code_description', 'nunique'): 'Products Coverage',
('customer_id', 'nunique'): 'Customers Coverage'},
(=True,
show_pie_charts_notes=True, boxplots_parameter = 'customer_id', show_outliers=False,
show_boxplots=False, example_type='sample', example_limit=3) show_example
======================================================================================================================================================
Evaluation of share: top_10_high_volume_customers_entries
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 13959 (2.7% of all entries)
Quantity: 783804 (14.5% of the total quantity)
Revenue: 1384755.4 (13.8% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into top_10_high_volume_customers_entries
.df_ecom_filtered
is generated in top_10_high_volume_customers_entries
.df_ecom_filtered
occurs in top_10_high_volume_customers_entries
. Every entry is counted separately, even if they are associated with the same order.top_10_high_volume_customers_entries
, it still counts as one full unique order in this chart.top_10_high_volume_customers_entries
, it still counts as one full unique product in this chart.top_10_high_volume_customers_entries
, they still count as one full unique customer in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
======================================================================================================================================================
Let’s visualize the main metrics of the top 10 high-volume customers: quantity, revenue and number of purchases.
# getting top 10 customers summary
= high_volume_customers_summary.copy().head(10)
top_10_customers_summary
# creating figure having secondary y-axis
= make_subplots(specs=[[{'secondary_y': True}]])
fig
# adding bar charts for quantity and revenue
fig.add_trace(=top_10_customers_summary['customer_id'], y=top_10_customers_summary['quantity'], name='Quantity', marker_color='teal', opacity=0.7),
go.Bar(x=False)
secondary_y
fig.add_trace(=top_10_customers_summary['customer_id'], y=top_10_customers_summary['revenue'], name='Revenue', marker_color='darkred', opacity=0.7),
go.Bar(x=False)
secondary_y
# adding line plots with markers for number of entries
fig.add_trace(
go.Scatter(=top_10_customers_summary['customer_id'], y=top_10_customers_summary['invoices_count'], name='Entries', line={'color': 'purple', 'width': 3}, mode='lines+markers', marker={'size': 8}),
x=True)
secondary_y
# updating layout and axes
fig.update_layout(={'text': 'Top 10 High-Volume Customers: Quantity, Revenue and Purchases (Entries)', 'font_size': 20, 'y': 0.95, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'},
title='group',
barmode='Customers',
xaxis_title=dict(tickangle=45, type='category'),
xaxis={'orientation': 'h', 'yanchor': 'bottom', 'y': 1.02, 'xanchor': 'right', 'x': 1},
legend=600,
height=1200)
width
='Quantity & Revenue', secondary_y=False)
fig.update_yaxes(title_text
fig.update_yaxes(='Entries',
title_text={'color': 'purple'},
tickfont={'color': 'purple'},
titlefont=True)
secondary_y; fig.show()
Additionally, let’s display the quantity totals and distributions of the top high-volume customers, here we will consider a wider range of 40 top customers for a broader overview.
'customer_id', 'quantity', n_items=40, show_outliers=True, fig_height=900) plot_totals_distribution(high_volume_customers_entries,
We see an outstanding customer with id “14646”, let’s take a closer look at its metrics.
# checking the share and examples of purchases made by the top high-volume customer
= high_volume_customers_entries.query('customer_id =="14646"')
the_top_high_volume_customer_entries
share_evaluation(the_top_high_volume_customer_entries, df_ecom_filtered, = True,
show_qty_rev =True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code_description', 'nunique'): 'Products Coverage',
('customer_id', 'nunique'): 'Customers Coverage'},
(=True,
show_pie_charts_notes=True,
show_boxplots=False,
show_outliers=True,
show_period=True, example_type='sample', example_limit=5) show_example
======================================================================================================================================================
Evaluation of share: the_top_high_volume_customer_entries
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 2064 (0.4% of all entries)
Quantity: 197420 (3.6% of the total quantity)
Revenue: 279138.0 (2.8% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into the_top_high_volume_customer_entries
.df_ecom_filtered
is generated in the_top_high_volume_customer_entries
.df_ecom_filtered
occurs in the_top_high_volume_customer_entries
. Every entry is counted separately, even if they are associated with the same order.the_top_high_volume_customer_entries
, it still counts as one full unique order in this chart.the_top_high_volume_customer_entries
, it still counts as one full unique product in this chart.the_top_high_volume_customer_entries
, they still count as one full unique customer in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
Invoice period coverage: 2018-12-18 - 2019-12-06 (94.6%; 353 out of 373 total days; 12 out of 12 total months)
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code initial_description quantity invoice_date unit_price customer_id invoice_year \
57415 541206 22029 SPACEBOY BIRTHDAY CARD 144 2019-01-12 12:24:00 0.36 14646 2019
434743 574059 22728 ALARM CLOCK BAKELIKE PINK 1 2019-10-31 14:13:00 3.75 14646 2019
314725 564650 22326 ROUND SNACK BOXES SET OF4 WOODLAND 48 2019-08-24 14:17:00 2.55 14646 2019
299002 563076 23256 CHILDRENS CUTLERY SPACEBOY 72 2019-08-09 16:12:00 3.75 14646 2019
186849 552883 22150 3 STRIPEY MICE FELTCRAFT 40 2019-05-10 10:13:00 1.65 14646 2019
invoice_month invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue \
57415 1 2019-01 2 2019-Week-02 2019-01-12 5 Saturday 51.84
434743 10 2019-10 44 2019-Week-44 2019-10-31 3 Thursday 3.75
314725 8 2019-08 34 2019-Week-34 2019-08-24 5 Saturday 122.40
299002 8 2019-08 32 2019-Week-32 2019-08-09 4 Friday 270.00
186849 5 2019-05 19 2019-Week-19 2019-05-10 4 Friday 66.00
description stock_code_description
57415 SPACEBOY BIRTHDAY CARD 22029__SPACEBOY BIRTHDAY CARD
434743 ALARM CLOCK BAKELIKE PINK 22728__ALARM CLOCK BAKELIKE PINK
314725 ROUND SNACK BOXES SET OF4 WOODLAND 22326__ROUND SNACK BOXES SET OF4 WOO...
299002 CHILDRENS CUTLERY SPACEBOY 23256__CHILDRENS CUTLERY SPACEBOY
186849 3 STRIPEY MICE FELTCRAFT 22150__3 STRIPEY MICE FELTCRAFT
======================================================================================================================================================
Business Customers
We observed that several customers have extremely high products coverage (product_range_share
column of the customers_summary
DataFrame), reaching almost half of the product range. They seem to be business-related customers, probably resellers. The profit from such a group of customers may benefit from a specially dedicated approach, so let’s learn more about them, first of all in terms of their share and overall impact.
Let’s define business customers as those whose purchases cover at least 10% of the product range. Following our definitions of high-volume customers and business customers, they must represent intersecting sets in fact.
# defining business customers
= customers_summary.query('product_range_share >= 0.1').sort_values(by='product_range_share', ascending=False)
business_customers_summary
= len(business_customers_summary)
business_customers_count = business_customers_count / len(customers_summary)
business_customers_share
= business_customers_summary.head(10)
top_10_business_customers_summary
print('=' * table_width)
print(f'\033[1mWe define business customers as those whose purchases cover at least 10% of the product range.\033[0m\n'
f'\033[1mTotal number of identified business customers:\033[0m {business_customers_count} ({business_customers_share*100 :0.1f}% of all customers)\n')
print(f'\033[1mTop 10 business customers summary:\033[0m\n')
print(top_10_business_customers_summary)
print('=' * table_width)
======================================================================================================================================================
We define business customers as those whose purchases cover at least 10% of the product range.
Total number of identified business customers: 32 (0.7% of all customers)
Top 10 business customers summary:
customer_id quantity revenue unit_price_mean unit_price_median invoices_count unique_invoices unique_products product_range_share \
1879 14911 80404 136161.83 3.33 2.08 5586 198 1785 0.46
325 12748 25051 31650.78 2.38 1.65 4397 206 1767 0.45
4007 17841 22814 40466.09 2.54 1.65 7666 124 1325 0.34
1289 14096 16336 53258.43 4.21 2.92 5095 17 1118 0.29
1434 14298 58343 51527.30 1.50 1.04 1637 44 884 0.23
1661 14606 6177 11926.15 2.80 1.65 2674 90 816 0.21
1779 14769 7238 10415.33 2.71 1.65 1061 8 717 0.18
1333 14156 57755 116560.08 3.40 2.10 1382 54 713 0.18
1689 14646 197420 279138.02 2.39 1.45 2064 73 703 0.18
561 13089 31025 58762.08 2.74 1.65 1814 97 636 0.16
entries_per_invoice_avg
1879 28.21
325 21.34
4007 61.82
1289 299.71
1434 37.20
1661 29.71
1779 132.62
1333 25.59
1689 28.27
561 18.70
======================================================================================================================================================
# checking the share of entries associated with the high-volume products
= business_customers_summary['customer_id'].tolist()
business_customers_list = df_ecom_filtered.query('customer_id in @business_customers_list')
business_customers_entries
share_evaluation(business_customers_entries, df_ecom_filtered, = True,
show_qty_rev = True,
show_pie_charts ={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code_description', 'nunique'): 'Products Coverage',
('customer_id', 'nunique'): 'Customers Coverage'},
(=True,
show_pie_charts_notes=True, show_outliers=False,
show_boxplots=True,
show_period=False) show_example
======================================================================================================================================================
Evaluation of share: business_customers_entries
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 54946 (10.5% of all entries)
Quantity: 765445 (14.1% of the total quantity)
Revenue: 1195534.7 (12.0% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into business_customers_entries
.df_ecom_filtered
is generated in business_customers_entries
.df_ecom_filtered
occurs in business_customers_entries
. Every entry is counted separately, even if they are associated with the same order.business_customers_entries
, it still counts as one full unique order in this chart.business_customers_entries
, it still counts as one full unique product in this chart.business_customers_entries
, they still count as one full unique customer in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
Invoice period coverage: 2018-11-29 - 2019-12-07 (100.0%; 373 out of 373 total days; 12 out of 12 total months)
======================================================================================================================================================
print('='*43)
'**High-volume customers vs. business customers**'))
display(Markdown(
print(f'\033[1m Share of the total quantity\033[0m')
print(f'\033[1m - High-volume customers:\033[0m {high_volume_customers_entries["quantity"].sum() / df_ecom_filtered["quantity"].sum():.1%}')
print(f'\033[1m - Business customers:\033[0m {business_customers_entries["quantity"].sum() / df_ecom_filtered["quantity"].sum():.1%}\n')
print(f'\033[1m Share of the total revenue\033[0m')
print(f'\033[1m - High-volume customers:\033[0m {high_volume_customers_entries["revenue"].sum() / df_ecom_filtered["revenue"].sum():.1%}')
print(f'\033[1m - Business customers:\033[0m {business_customers_entries["revenue"].sum() / df_ecom_filtered["revenue"].sum():.1%}')
print('-'*43)
print(f'\033[1m Median coverage of the product range\033[0m')
print(f'\033[1m - High-volume customers:\033[0m {high_volume_customers_summary["product_range_share"].median():.1%}')
print(f'\033[1m - Business customers:\033[0m {business_customers_summary["product_range_share"].median():.1%}\n')
print(f'\033[1mMedian quantity per purchase\033[0m')
print(f'\033[1m - High-volume customers:\033[0m {high_volume_customers_entries["quantity"].median():.0f}')
print(f'\033[1m - Business customers:\033[0m {business_customers_entries["quantity"].median():.0f}\n')
print(f'\033[1mMedian quantity per order\033[0m')
print(f'\033[1m - High-volume customers:\033[0m {high_volume_customers_entries.groupby("invoice_no")["quantity"].sum().median():.0f}')
print(f'\033[1m - Business customers:\033[0m {business_customers_entries.groupby("invoice_no")["quantity"].sum().median():.0f}')
print('='*43)
===========================================
High-volume customers vs. business customers
Share of the total quantity
- High-volume customers: 44.9%
- Business customers: 14.1%
Share of the total revenue
- High-volume customers: 39.9%
- Business customers: 12.0%
-------------------------------------------
Median coverage of the product range
- High-volume customers: 3.9%
- Business customers: 12.2%
Median quantity per purchase
- High-volume customers: 8
- Business customers: 3
Median quantity per order
- High-volume customers: 248
- Business customers: 185
===========================================
Also, let’s examine how many of the top contributing high-volume customers and business customers are the same, we will do that by comparing the 20 top of each of these groups. We will also display the quantity totals and distributions of top high-volume customers,
# getting list of top 50 products among top sales-driving and top revenue-driving products
= set(high_volume_customers_summary.sort_values(by='quantity', ascending=False).head(20)['customer_id'])
top_20_high_volume_customers = set(business_customers_summary.sort_values(by='quantity', ascending=False).head(20)['customer_id']) top_20_business_customers
= top_20_high_volume_customers.intersection(top_20_business_customers)
common_customers_quantity = len(common_customers_quantity)
number_of_common_customers = number_of_common_customers / 20
share_of_common_customers
print('='*113)
print(f'\033[1mShare of common customers among the top high-volume customers and the top business customers:\033[0m {share_of_common_customers :0.1%} ({number_of_common_customers} out of 20)')
print('='*113)
=================================================================================================================
Share of common customers among the top high-volume customers and the top business customers: 40.0% (8 out of 20)
=================================================================================================================
'customer_id', 'quantity', n_items=20, show_outliers=True) plot_totals_distribution(business_customers_entries,
There are 8 out of 20 customers that are in common among the top high-volume customers and the top business customers, which makes 40% of them. We also see that there are very evident leaders among top business customers. And it looks like the share of quantity they are associated with in common is much more than those 40%. Let’s check it out.
= df_ecom_filtered.query('customer_id in @common_customers_quantity')
common_top_8_quantity_customers_entries = True) share_evaluation(common_top_8_quantity_customers_entries, df_ecom_filtered, show_qty_rev
======================================================================================================================================================
Evaluation of share: common_top_8_quantity_customers_entries
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 16527 (3.2% of all entries)
Quantity: 605312 (11.2% of the total quantity)
Revenue: 918409.0 (9.2% of the total revenue)
======================================================================================================================================================
Observations
The top 5% of most buying customers (high-volume customers according to our definition) represent ~20% of all entries, ~45% of the total quantity, and ~40% of the total revenue.
The mean quantity per purchase (~23) is almost three times the median (8), indicating very significant distribution skewness and the impact of major purchases.
High-volume customers buy a wide variety of products-not just a few types of items in bulk, so 5% of customers cover 83% of unique products.
The box plots reveal significant variability in purchasing behavior across customers. Most customers have narrow interquartile ranges, indicating consistent purchasing behavior. For example, the top customer “14646” displays a wide range with high variability and outliers extending beyond 2000 units, reflecting sporadic large purchases. Meanwhile, other customers show occasional outliers but with smaller ranges.
The top high-volume customer’s impact is outstanding:
The next highest customers generate around 80k units each, showing a significant gap between the top customer (197k) and others.
The highest purchase frequency is seen for customer “14911” (~5600 entries), while most customers in the top 20 maintain between 300-1500 entries.
Note: A significant share of purchases is performed by undefined customers-~25% of all entries, 8% of total quantity, and ~15% of total revenue.
For time saving purposes we will study the data already cleaned on the previous stage.
# checking outliers with IQR approach + descriptive statistics
='unit_price', x_limits=[0, 25], title_extension='', bins=[100, 400], outliers_info=True) distribution_IQR(df_ecom_filtered, parameter
Note: A sample data slice 2% of "df_ecom_filtered" was used for histogram plotting instead of the full DataFrame.
This significantly reduced plotting time for the large dataset. The accuracy of the visualization might be slightly reduced, meanwhile it should be sufficient for exploratory analysis.
==================================================
Statistics on unit_price
in df_ecom_filtered
count 522980.00
mean 3.27
std 4.40
min 0.00
25% 1.25
50% 2.08
75% 4.13
max 649.50
Name: unit_price, dtype: float64
--------------------------------------------------
The distribution is extremely skewed to the right
(skewness: 25.7)
Note: outliers affect skewness calculation
--------------------------------------------------
Min border: -3
Max border: 8
--------------------------------------------------
The outliers are considered to be values above 8
We have 44542 values that we can consider outliers
Which makes 8.5% of the total "unit_price" data
==================================================
# let's check descriptive statistics of unit price by product
= df_ecom_filtered.groupby('stock_code_description')['unit_price']
products_unit_price_ranges #products_unit_price_std = products_unit_price_ranges.std().mean()
#products_unit_price_var = products_unit_price_ranges.var().mean()
= products_unit_price_ranges.apply(lambda x: x.std() / x.mean()* 100).mean()
products_unit_price_cov
print(f'\033[1mAverage coefficient of variation of product price (across products):\033[0m {products_unit_price_cov:.1f}%')
Average coefficient of variation of product price (across products): 32.9%
# checking outliers with the percentile approach
='unit_price', lower_percentile=3, upper_percentile=97, print_limits=True) percentile_outliers(df_ecom_filtered, parameter
==============================================================================================================
Data on unit_price
outliers based on the “percentile approach”
The outliers are considered to be values below 0.39 and above 12.46
We have 24886 values that we can consider outliers
Which makes 4.8% of the total "unit_price" data
--------------------------------------------------------------------------------------------------------------
Limits: {'df_ecom_filtered_unit_price_lower_limit': 0.39, 'df_ecom_filtered_unit_price_upper_limit': 12.46}
==============================================================================================================
We see two major outliers on the boxplot, let’s study them deeper just in case.
# checking the share of entries with the most obvious outliers in 'unit_price'
= df_ecom_filtered.query('unit_price > 200')
unit_price_top_outliers_entries =True, show_period=False, show_example=True, example_type='sample', example_limit=5, random_state=10) share_evaluation(unit_price_top_outliers_entries, df_ecom_filtered, show_qty_rev
======================================================================================================================================================
Evaluation of share: unit_price_top_outliers_entries
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 10 (0.0% of all entries)
Quantity: 69 (0.0% of the total quantity)
Revenue: 41979.5 (0.4% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code initial_description quantity invoice_date unit_price customer_id invoice_year invoice_month \
222680 556444 22502 PICNIC BASKET WICKER 60 PIECES 60 2019-06-08 15:28:00 649.50 15098 2019 6
51636 540647 22655 VINTAGE RED KITCHEN CABINET 1 2019-01-08 14:57:00 295.00 17406 2019 1
133994 547814 22656 VINTAGE BLUE KITCHEN CABINET 1 2019-03-23 14:19:00 295.00 13452 2019 3
171178 551393 22656 VINTAGE BLUE KITCHEN CABINET 1 2019-04-26 12:22:00 295.00 14973 2019 4
82768 543253 22655 VINTAGE RED KITCHEN CABINET 1 2019-02-02 15:32:00 295.00 14842 2019 2
invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue description \
222680 2019-06 23 2019-Week-23 2019-06-08 5 Saturday 38970.00 PICNIC BASKET WICKER SMALL
51636 2019-01 2 2019-Week-02 2019-01-08 1 Tuesday 295.00 VINTAGE RED KITCHEN CABINET
133994 2019-03 12 2019-Week-12 2019-03-23 5 Saturday 295.00 VINTAGE BLUE KITCHEN CABINET
171178 2019-04 17 2019-Week-17 2019-04-26 4 Friday 295.00 VINTAGE BLUE KITCHEN CABINET
82768 2019-02 5 2019-Week-05 2019-02-02 5 Saturday 295.00 VINTAGE RED KITCHEN CABINET
stock_code_description
222680 22502__PICNIC BASKET WICKER SMALL
51636 22655__VINTAGE RED KITCHEN CABINET
133994 22656__VINTAGE BLUE KITCHEN CABINET
171178 22656__VINTAGE BLUE KITCHEN CABINET
82768 22655__VINTAGE RED KITCHEN CABINET
======================================================================================================================================================
Vintage cabinets and picnic baskets (product descriptions representing outliers) appear to be normal goods. It’s hard to say whether the prices are reasonable. Just in case let’s check these entries.
# checking products with suspiciously high unit prices
= unit_price_top_outliers_entries['stock_code'].unique()
products_top_price_outliers
'stock_code in @products_top_price_outliers').groupby(['stock_code_description','initial_description'])['unit_price'].value_counts() df_ecom_filtered.query(
stock_code_description initial_description unit_price
22502__PICNIC BASKET WICKER SMALL PICNIC BASKET WICKER 60 PIECES 649.50 2
PICNIC BASKET WICKER SMALL 5.95 209
10.79 98
8.29 96
4.95 30
8.47 29
0.00 1
2.00 1
8.95 1
22655__VINTAGE RED KITCHEN CABINET VINTAGE RED KITCHEN CABINET 125.00 31
295.00 5
50.00 2
22656__VINTAGE BLUE KITCHEN CABINET VINTAGE BLUE KITCHEN CABINET 125.00 16
295.00 3
50.00 1
Name: count, dtype: int64
# checking top-price entries of the most suspicious stock code in the original `df_ecom` DataFrame
'stock_code == "22502" and unit_price == 649.5') df_ecom.query(
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
222680 | 556444 | 22502 | PICNIC BASKET WICKER 60 PIECES | 60 | 2019-06-08 15:28:00 | 649.50 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | 38970.00 |
222682 | 556446 | 22502 | PICNIC BASKET WICKER 60 PIECES | 1 | 2019-06-08 15:33:00 | 649.50 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | 649.50 |
# checking entries of the customer, who made the suspicious purchase
'customer_id == "15098"') df_ecom_filtered.query(
invoice_no | stock_code | initial_description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | description | stock_code_description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
222670 | 556442 | 22502 | PICNIC BASKET WICKER SMALL | 60 | 2019-06-08 15:22:00 | 4.95 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | 297.00 | PICNIC BASKET WICKER SMALL | 22502__PICNIC BASKET WICKER SMALL |
222680 | 556444 | 22502 | PICNIC BASKET WICKER 60 PIECES | 60 | 2019-06-08 15:28:00 | 649.50 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | 38970.00 | PICNIC BASKET WICKER SMALL | 22502__PICNIC BASKET WICKER SMALL |
222682 | 556446 | 22502 | PICNIC BASKET WICKER 60 PIECES | 1 | 2019-06-08 15:33:00 | 649.50 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | 649.50 | PICNIC BASKET WICKER SMALL | 22502__PICNIC BASKET WICKER SMALL |
# checking entries with suspicious description "PICNIC BASKET WICKER 60 PIECES"
'initial_description == "PICNIC BASKET WICKER 60 PIECES"') df_ecom_filtered.query(
invoice_no | stock_code | initial_description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | description | stock_code_description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
222680 | 556444 | 22502 | PICNIC BASKET WICKER 60 PIECES | 60 | 2019-06-08 15:28:00 | 649.50 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | 38970.00 | PICNIC BASKET WICKER SMALL | 22502__PICNIC BASKET WICKER SMALL |
222682 | 556446 | 22502 | PICNIC BASKET WICKER 60 PIECES | 1 | 2019-06-08 15:33:00 | 649.50 | 15098 | 2019 | 6 | 2019-06 | 23 | 2019-Week-23 | 2019-06-08 | 5 | Saturday | 649.50 | PICNIC BASKET WICKER SMALL | 22502__PICNIC BASKET WICKER SMALL |
'customer_id == "15098"'), df_ecom_filtered, show_qty_rev=True) share_evaluation(df_ecom_filtered.query(
======================================================================================================================================================
Evaluation of share: the data slice mentioned in the call function
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 3 (0.0% of all entries)
Quantity: 121 (0.0% of the total quantity)
Revenue: 39916.5 (0.4% of the total revenue)
======================================================================================================================================================
Observations and Decisions
Implementation of Decisions
# cleaning out the main top-price outlier - product with `customer_id` "15098"
= lambda df: df.query('customer_id != "15098"')
operation = data_reduction(df_ecom_filtered, operation) df_ecom_filtered
Number of entries cleaned out from the "df_ecom_filtered": 3 (0.0%)
Let’s check entries with zero unit prices.
= df_ecom_filtered.query('unit_price == 0') zero_unit_price_entries
# checking share of entries with zero prices
=True, show_qty_rev=True,
share_evaluation(zero_unit_price_entries, df_ecom_filtered, show_period=True, example_type='sample', example_limit=5) show_example
======================================================================================================================================================
Evaluation of share: zero_unit_price_entries
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 412 (0.1% of all entries)
Quantity: 17051 (0.3% of the total quantity)
Revenue: 0.0 (0.0% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
Invoice period coverage: 2018-12-03 - 2019-12-06 (98.7%; 368 out of 373 total days; 12 out of 12 total months)
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code initial_description quantity invoice_date unit_price customer_id invoice_year invoice_month \
41456 539856 37333 RETRO "TEA FOR ONE" 1 2018-12-20 14:41:00 0.00 0 2018 12
193212 553521 22514 CHILDS GARDEN SPADE BLUE 2 2019-05-15 14:35:00 0.00 0 2019 5
313646 564530 22679 FRENCH BLUE METAL DOOR SIGN 4 3 2019-08-23 14:57:00 0.00 0 2019 8
41467 539856 22679 FRENCH BLUE METAL DOOR SIGN 4 2 2018-12-20 14:41:00 0.00 0 2018 12
104422 545176 84968E SET OF 16 VINTAGE BLACK CUTLERY 1 2019-02-26 14:19:00 0.00 0 2019 2
invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue \
41456 2018-12 51 2018-Week-51 2018-12-20 3 Thursday 0.00
193212 2019-05 20 2019-Week-20 2019-05-15 2 Wednesday 0.00
313646 2019-08 34 2019-Week-34 2019-08-23 4 Friday 0.00
41467 2018-12 51 2018-Week-51 2018-12-20 3 Thursday 0.00
104422 2019-02 9 2019-Week-09 2019-02-26 1 Tuesday 0.00
description stock_code_description
41456 RETRO "TEA FOR ONE" 37333__RETRO "TEA FOR ONE"
193212 CHILDS GARDEN SPADE BLUE 22514__CHILDS GARDEN SPADE BLUE
313646 FRENCH BLUE METAL DOOR SIGN 4 22679__FRENCH BLUE METAL DOOR SIGN 4
41467 FRENCH BLUE METAL DOOR SIGN 4 22679__FRENCH BLUE METAL DOOR SIGN 4
104422 SET OF 16 VINTAGE BLACK CUTLERY 84968E__SET OF 16 VINTAGE BLACK CUTLERY
======================================================================================================================================================
# checking distribution of quantity in entries with zero unit prices.
='quantity', x_limits=[0, 30], title_extension='', bins=[3000, 12000], outliers_info=False) distribution_IQR(zero_unit_price_entries, parameter
==================================================
Statistics on quantity
in zero_unit_price_entries
count 412.00
mean 41.39
std 618.59
min 1.00
25% 1.00
50% 1.00
75% 3.00
max 12540.00
Name: quantity, dtype: float64
--------------------------------------------------
The distribution is extremely skewed to the right
(skewness: 20.2)
Note: outliers affect skewness calculation
==================================================
# checking entries of the main quantity outliers associated with zero price units
'quantity > 1000') zero_unit_price_entries.query(
invoice_no | stock_code | initial_description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | description | stock_code_description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
502122 | 578841 | 84826 | ASSTD DESIGN 3D PAPER STICKERS | 12540 | 2019-11-23 15:57:00 | 0.00 | 13256 | 2019 | 11 | 2019-11 | 47 | 2019-Week-47 | 2019-11-23 | 5 | Saturday | 0.00 | ASSTD DESIGN 3D PAPER STICKERS | 84826__ASSTD DESIGN 3D PAPER STICKERS |
Above we checked the data in the already cleaned df_ecom_filtered
DataFrame. However, previously we mentioned that there are a lot of operational entries, that we’ve cleaned out, that affect quantity, but not revenues. Just to make sure we understand the nature of all zero price entries correctly, let’s also check zero price entries in the initial df_ecom
DataFrame.
# checking zero price entries in the initial `df_ecom` DataFrame
'unit_price == 0')['description'].value_counts()
df_ecom.query('unit_price == 0').sample(5, random_state = 7) df_ecom.query(
description
check 159
? 47
damages 45
damaged 43
found 25
...
HEART GARLAND RUSTIC PADDED 1
CHICK GREY HOT WATER BOTTLE 1
mystery! Only ever imported 1800 1
MERCHANT CHANDLER CREDIT ERROR, STO 1
lost 1
Name: count, Length: 376, dtype: int64
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
338957 | 566573 | 22823 | test | -22 | 2019-09-11 13:10:00 | 0.00 | 0 | 2019 | 9 | 2019-09 | 37 | 2019-Week-37 | 2019-09-11 | 2 | Wednesday | -0.00 |
14363 | 537534 | 22428 | ENAMEL FIRE BUCKET CREAM | 3 | 2018-12-05 11:48:00 | 0.00 | 0 | 2018 | 12 | 2018-12 | 49 | 2018-Week-49 | 2018-12-05 | 2 | Wednesday | 0.00 |
14383 | 537534 | 22202 | MILK PAN PINK POLKADOT | 2 | 2018-12-05 11:48:00 | 0.00 | 0 | 2018 | 12 | 2018-12 | 49 | 2018-Week-49 | 2018-12-05 | 2 | Wednesday | 0.00 |
344884 | 567125 | 21246 | damaged | -2 | 2019-09-14 13:49:00 | 0.00 | 0 | 2019 | 9 | 2019-09 | 37 | 2019-Week-37 | 2019-09-14 | 5 | Saturday | -0.00 |
436421 | 574123 | 22652 | check | -111 | 2019-11-01 10:55:00 | 0.00 | 0 | 2019 | 11 | 2019-11 | 44 | 2019-Week-44 | 2019-11-01 | 4 | Friday | -0.00 |
Observations and Decisions
df_ecom_filtered
DataFrame).df_ecom
DataFrame)..Implementation of Decisions
# cleaning out zero unit price entries from wholesale_purchases
= lambda df: df.query('unit_price != 0')
operation = data_reduction(df_ecom_filtered, operation) df_ecom_filtered
Number of entries cleaned out from the "df_ecom_filtered": 412 (0.1%)
In this section, we will analyze high-priced items in three ways:
unit_price
falls within the top 5% of the price range across all entries.Note: Given a quite substantial (~33%) average coefficient of variation of unit price among products, top-price entries likely are not equal to overall entries of expensive products, so we study them separately.
Top-Price Purchases# checking top-price purchases - top 5% by unit_price
= np.percentile(df_ecom_filtered['unit_price'], 95)
top_price_threshold = df_ecom_filtered.query('unit_price > @top_price_threshold').sort_values(by='unit_price', ascending=False)
top_price_entries
print('='*115)
print(f'\033[1mWe consider top-price purchases as entries with unit price above {top_price_threshold :.0f} (top 5% of unit price range across all entries)\033[0m')
print('='*115)
===================================================================================================================
We consider top-price purchases as entries with unit price above 10 (top 5% of unit price range across all entries)
===================================================================================================================
# checking the share of entries with `unit_price` above the upper limit (top 5%)
= df_ecom_filtered.query('unit_price > @top_price_threshold')
top_price_entries
share_evaluation(top_price_entries, df_ecom_filtered, = True,
show_qty_rev =True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code_description', 'nunique'): 'Products Coverage',
('customer_id', 'nunique'): 'Customers Coverage'},
(=True,
show_pie_charts_notes=True,
show_boxplots=True,
show_outliers=True, example_type='sample', example_limit=5) show_example
======================================================================================================================================================
Evaluation of share: top_price_entries
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 22422 (4.3% of all entries)
Quantity: 58464 (1.1% of the total quantity)
Revenue: 828158.8 (8.3% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into top_price_entries
.df_ecom_filtered
is generated in top_price_entries
.df_ecom_filtered
occurs in top_price_entries
. Every entry is counted separately, even if they are associated with the same order.top_price_entries
, it still counts as one full unique order in this chart.top_price_entries
, it still counts as one full unique product in this chart.top_price_entries
, they still count as one full unique customer in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code initial_description quantity invoice_date unit_price customer_id invoice_year invoice_month \
127825 547248 22654 DELUXE SEWING KIT 1 2019-03-20 09:23:00 11.63 0 2019 3
123540 546896 22649 STRAWBERRY FAIRY CAKE TEAPOT 1 2019-03-15 18:24:00 10.79 0 2019 3
202098 554362 22849 BREAD BIN DINER STYLE MINT 4 2019-05-22 10:17:00 14.95 17811 2019 5
24403 538349 21534 DAIRY MAID LARGE MILK JUG 1 2018-12-08 14:59:00 10.17 0 2018 12
174713 551844 23009 I LOVE LONDON BABY GIFT SET 1 2019-05-02 14:03:00 16.95 14173 2019 5
invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue description \
127825 2019-03 12 2019-Week-12 2019-03-20 2 Wednesday 11.63 DELUXE SEWING KIT
123540 2019-03 11 2019-Week-11 2019-03-15 4 Friday 10.79 STRAWBERRY FAIRY CAKE TEAPOT
202098 2019-05 21 2019-Week-21 2019-05-22 2 Wednesday 59.80 BREAD BIN DINER STYLE MINT
24403 2018-12 49 2018-Week-49 2018-12-08 5 Saturday 10.17 DAIRY MAID LARGE MILK JUG
174713 2019-05 18 2019-Week-18 2019-05-02 3 Thursday 16.95 I LOVE LONDON BABY GIFT SET
stock_code_description
127825 22654__DELUXE SEWING KIT
123540 22649__STRAWBERRY FAIRY CAKE TEAPOT
202098 22849__BREAD BIN DINER STYLE MINT
24403 21534__DAIRY MAID LARGE MILK JUG
174713 23009__I LOVE LONDON BABY GIFT SET
======================================================================================================================================================
Let’s examine customers’ repetitions of purchasing expensive products. Our approach will be similar to that with wholesale purchases study: we will group our top-price entries on products, and calculate unique_invoices_per_customer_avg
metric. Since ~25% of entries contain unknown customers (customer_id - “0”), we will filter them out, as they will affect our calculations (otherwise, all unknown customers will act as one unique customer).
# aggregating data by product
= (top_price_entries.query('customer_id != "0"').groupby(['stock_code_description'])
top_price_entries_products_summary 'quantity': 'sum',
.agg({'revenue': 'sum',
'invoice_no': 'nunique',
'customer_id': 'nunique'})
.reset_index()round(1)
).
= ['stock_code_description',
top_price_entries_products_summary.columns 'quantity',
'revenue',
'unique_invoices',
'unique_customers']
'unique_invoices_per_customer_avg'] = round(
top_price_entries_products_summary['unique_invoices'] / top_price_entries_products_summary['unique_customers'],
top_price_entries_products_summary[2)
# checking the results
print('='*table_width)
print(f'\033[1mDataFrame `top_price_entries_products_summary`:\033[0m')
top_price_entries_products_summaryprint('-'*table_width)
print(f'\033[1mDescriptive statistics on top-price purchases (with prices in the top 5% of the price range) grouped by product:\033[0m')
'unique_customers', 'unique_invoices_per_customer_avg']].describe()
top_price_entries_products_summary[[print('='*table_width)
======================================================================================================================================================
DataFrame `top_price_entries_products_summary`:
stock_code_description | quantity | revenue | unique_invoices | unique_customers | unique_invoices_per_customer_avg | |
---|---|---|---|---|---|---|
0 | 15056BL__EDWARDIAN PARASOL BLACK | 2 | 24.90 | 2 | 1 | 2.00 |
1 | 15056N__EDWARDIAN PARASOL NATURAL | 1 | 12.50 | 1 | 1 | 1.00 |
2 | 15056P__EDWARDIAN PARASOL PINK | 1 | 12.50 | 1 | 1 | 1.00 |
3 | 20679__EDWARDIAN PARASOL RED | 2 | 24.90 | 2 | 1 | 2.00 |
4 | 20685__DOORMAT RED RETROSPOT | 2 | 31.60 | 2 | 1 | 2.00 |
... | ... | ... | ... | ... | ... | ... |
281 | 90178A__AMBER CHUNKY GLASS+BEAD NECK... | 6 | 71.70 | 6 | 6 | 1.00 |
282 | 90178B__PURPLE CHUNKY GLASS+BEAD NEC... | 1 | 12.00 | 1 | 1 | 1.00 |
283 | 90191__SILVER LARIAT 40CM | 5 | 63.80 | 4 | 4 | 1.00 |
284 | 90196A__PURPLE GEMSTONE NECKLACE 45CM | 8 | 102.00 | 5 | 5 | 1.00 |
285 | 90196B__BLACK GEMSTONE NECKLACE 45CM | 4 | 51.00 | 4 | 4 | 1.00 |
286 rows × 6 columns
------------------------------------------------------------------------------------------------------------------------------------------------------
Descriptive statistics on top-price purchases (with prices in the top 5% of the price range) grouped by product:
unique_customers | unique_invoices_per_customer_avg | |
---|---|---|
count | 286.00 | 286.00 |
mean | 23.38 | 2.03 |
std | 64.47 | 2.00 |
min | 1.00 | 1.00 |
25% | 1.00 | 1.00 |
50% | 1.00 | 1.15 |
75% | 20.75 | 2.00 |
max | 880.00 | 12.00 |
======================================================================================================================================================
Expensive Products
Let’s define expensive products as those whose median unit price falls within the top 5% of all products’ median unit prices, where the median is calculated across all entries for each product.
Given the highly skewed unit_price
distribution, we will start by calculating median prices of products (since medians better than means represent typical values in cases of non-normal distributions) and other key metrics for each product.
# aggregating data by stock_code_description
= (
products_summary 'stock_code_description')
df_ecom_filtered.groupby('unit_price': 'median',
.agg({'quantity': 'sum',
'revenue': 'sum',
'invoice_no': 'nunique'})
.reset_index()='unit_price', ascending=False)
.sort_values(by={'invoice_no': 'unique_invoices', 'unit_price': 'unit_price_median'}))
.rename(columns products_summary
stock_code_description | unit_price_median | quantity | revenue | unique_invoices | |
---|---|---|---|---|---|
1695 | 22827__RUSTIC SEVENTEEN DRAWER SIDEB... | 165.00 | 35 | 5415.00 | 26 |
1696 | 22828__REGENCY MIRROR WITH SHUTTERS | 165.00 | 10 | 1530.00 | 7 |
1529 | 22655__VINTAGE RED KITCHEN CABINET | 125.00 | 60 | 8125.00 | 38 |
1530 | 22656__VINTAGE BLUE KITCHEN CABINET | 125.00 | 26 | 3685.00 | 20 |
1691 | 22823__CHEST NATURAL WOOD 20 DRAWERS | 125.00 | 24 | 2745.00 | 13 |
... | ... | ... | ... | ... | ... |
78 | 16259__PIECE OF CAMO STATIONERY SET | 0.08 | 3380 | 326.56 | 31 |
66 | 16216__LETTER SHAPE PENCIL SHARPENER | 0.06 | 3333 | 234.00 | 45 |
67 | 16218__CARTOON PENCIL SHARPENERS | 0.06 | 3821 | 283.31 | 64 |
39 | 16045__POPART WOODEN PENCILS ASST | 0.04 | 8900 | 380.00 | 68 |
3913 | PADS__PADS TO MATCH ALL CUSHIONS | 0.00 | 3 | 0.00 | 3 |
3919 rows × 5 columns
# calculating the top price threshold
= round(np.percentile(products_summary['unit_price_median'], 95), 2)
products_unit_price_top_threshold products_unit_price_top_threshold
9.95
# defining the most expensive products
= products_summary.query('unit_price_median > @products_unit_price_top_threshold')
expensive_products_summary = expensive_products_summary['stock_code_description'].tolist()
expensive_products_list
# evaluating median unit prices
= expensive_products_summary['unit_price_median'].median()
expensive_products_unit_price_median = df_ecom_filtered['unit_price'].median()
general_unit_price_median
print('='*116)
print(f'\033[1mWe consider expensive products as those with median unit price more than '
f'{products_unit_price_top_threshold:.2f}\033[0m (within the top 5% of the price range)\n'
f'\033[1mThe number of expensive products:\033[0m {len(expensive_products_summary)} ({len(expensive_products_summary) / len(products_summary) :0.1%} of the product range)\n'
f'\033[1mThe median unit price of expensive products:\033[0m {expensive_products_unit_price_median :0.1f} '
f'({expensive_products_unit_price_median / general_unit_price_median :0.1f} times higher than that of an average product ({general_unit_price_median :0.1f}))')
print('='*116)
====================================================================================================================
We consider expensive products as those with median unit price more than 9.95 (within the top 5% of the price range)
The number of expensive products: 177 (4.5% of the product range)
The median unit price of expensive products: 14.9 (7.2 times higher than that of an average product (2.1))
====================================================================================================================
# checking the share of entries of the most expensive products
= df_ecom_filtered.query('stock_code_description in @expensive_products_list')
expensive_products_entries
share_evaluation(expensive_products_entries, df_ecom_filtered, = True,
show_qty_rev =True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code_description', 'nunique'): 'Products Coverage',
('customer_id', 'nunique'): 'Customers Coverage'},
(=True,
show_pie_charts_notes=True,
show_boxplots=True,
show_outliers=True, example_type='sample', example_limit=5) show_example
======================================================================================================================================================
Evaluation of share: expensive_products_entries
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 12130 (2.3% of all entries)
Quantity: 43718 (0.8% of the total quantity)
Revenue: 601511.2 (6.0% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into expensive_products_entries
.df_ecom_filtered
is generated in expensive_products_entries
.df_ecom_filtered
occurs in expensive_products_entries
. Every entry is counted separately, even if they are associated with the same order.expensive_products_entries
, it still counts as one full unique order in this chart.expensive_products_entries
, it still counts as one full unique product in this chart.expensive_products_entries
, they still count as one full unique customer in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code initial_description quantity invoice_date unit_price customer_id invoice_year invoice_month \
96616 544572 22839 3 TIER CAKE TIN GREEN AND CREAM 1 2019-02-19 13:21:00 14.95 14639 2019 2
273445 560833 23010 CIRCUS PARADE BABY GIFT SET 1 2019-07-19 12:14:00 16.95 16891 2019 7
62267 541497 84968A SET OF 16 VINTAGE ROSE CUTLERY 1 2019-01-16 15:19:00 8.29 0 2019 1
89363 543901 22509 SEWING BOX RETROSPOT DESIGN 2 2019-02-12 12:13:00 16.95 17659 2019 2
197964 553946 23111 PARISIENNE SEWING BOX 1 2019-05-18 10:48:00 12.50 15601 2019 5
invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue \
96616 2019-02 8 2019-Week-08 2019-02-19 1 Tuesday 14.95
273445 2019-07 29 2019-Week-29 2019-07-19 4 Friday 16.95
62267 2019-01 3 2019-Week-03 2019-01-16 2 Wednesday 8.29
89363 2019-02 7 2019-Week-07 2019-02-12 1 Tuesday 33.90
197964 2019-05 20 2019-Week-20 2019-05-18 5 Saturday 12.50
description stock_code_description
96616 3 TIER CAKE TIN GREEN AND CREAM 22839__3 TIER CAKE TIN GREEN AND CREAM
273445 CIRCUS PARADE BABY GIFT SET 23010__CIRCUS PARADE BABY GIFT SET
62267 SET OF 16 VINTAGE ROSE CUTLERY 84968A__SET OF 16 VINTAGE ROSE CUTLERY
89363 SEWING BOX RETROSPOT DESIGN 22509__SEWING BOX RETROSPOT DESIGN
197964 PARISIENNE SEWING BOX 23111__PARISIENNE SEWING BOX
======================================================================================================================================================
Let’s create visualization of price distributions for randomly selected expensive products. These graphs can often provide more insight than descriptive statistics alone.
# checking unit price distribution for top expensive products
'stock_code_description', 'unit_price', title_extension='among expensive products', sample_type='sample', random_state=7, n_items=20, show_outliers=False, plot_totals=False) plot_totals_distribution(expensive_products_entries,
Most Expensive Products
In the next step we will study the most significant top-priced products. To do so we will first filter out rarely purchased products and those having minor number of items sold. Let’s exclude products whose total volume sold and total orders are below the 25 percentile of these metrics.
= np.percentile(products_summary['quantity'], 25)
products_quantity_25_percentile = np.percentile(products_summary['unique_invoices'], 25)
products_invoices_25_percentile
print('='*53)
print(f'\033[1m25th percentile of overall quantity per product:\033[0m {products_quantity_25_percentile:.1f}')
print(f'\033[1m25th percentile of orders per product:\033[0m {products_invoices_25_percentile:.1f}')
print('='*53)
=====================================================
25th percentile of overall quantity per product: 54.0
25th percentile of orders per product: 16.0
=====================================================
# filtering out unpopular products
= expensive_products_summary.query('quantity > = @products_quantity_25_percentile and unique_invoices >= @products_invoices_25_percentile')
expensive_products_summary_popular
print('='*66)
print(f'\033[1mTotal expensive products:\033[0m {len(expensive_products_summary)}')
print(f'\033[1mPopular expensive products:\033[0m {len(expensive_products_summary_popular)} '
f'({len(expensive_products_summary_popular)/len(expensive_products_summary) * 100:.1f}% of total expensive products)')
print('='*66)
==================================================================
Total expensive products: 177
Popular expensive products: 88 (49.7% of total expensive products)
==================================================================
# defining the top 10 most expensive products and associated entries
= expensive_products_summary_popular.sort_values(by='unit_price_median').head(10)
top_10_expensive_summary = top_10_expensive_summary['stock_code_description'].to_list()
top_10_expensive_list
print('='*45)
print(f'\033[1mTop 10 most expensive products:\033[0m')
top_10_expensive_listprint('='*45)
=============================================
Top 10 most expensive products:
['23085__ANTIQUE SILVER BAUBLE LAMP',
'23142__IVORY WIRE KITCHEN ORGANISER',
'47570B__TEA TIME TABLE CLOTH',
'22832__BROCANTE SHELF WITH HOOKS',
'15058C__ICE CREAM DESIGN GARDEN PARASOL',
'15058B__PINK POLKADOT GARDEN PARASOL',
'22165__DIAMANTE HEART SHAPED WALL MIRROR,',
'22461__SAVOY ART DECO CLOCK',
'85163B__BLACK BAROQUE WALL CLOCK',
'21843__RED RETROSPOT CAKE STAND']
=============================================
# checking the share of the top 10 most expensive products and associated entries
= df_ecom_filtered.query('stock_code_description in @top_10_expensive_list')
top_10_expensive_products_entries
=True, show_period=False,
share_evaluation(top_10_expensive_products_entries, df_ecom_filtered, show_qty_rev=True, boxplots_parameter = 'stock_code_description', show_outliers=False,
show_boxplots=False, example_type='sample', example_limit=3) show_example
======================================================================================================================================================
Evaluation of share: top_10_expensive_products_entries
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 1177 (0.2% of all entries)
Quantity: 4062 (0.1% of the total quantity)
Revenue: 38548.0 (0.4% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
======================================================================================================================================================
We won’t visualize the main metrics of the top 10 most expensive products (unlike our approach for the top 10 high-volume customers). This is due to their minor impact on overall analysis.
Observations
Let’s also check the value of cheap products (those with median unit price in bottom 5% of all products’ median unit prices).
# calculating the bottom price threshold
= round(np.percentile(products_summary['unit_price_median'], 5), 2)
products_unit_price_bottom_threshold products_unit_price_bottom_threshold
0.39
# defining the cheapest products
= products_summary.query('unit_price_median < @products_unit_price_bottom_threshold')
cheap_products_summary = cheap_products_summary['stock_code_description'].tolist()
cheap_products_list
# evaluating median unit prices
= cheap_products_summary['unit_price_median'].median()
cheap_products_unit_price_median
print('='*116)
print(f'\033[1mWe consider cheap products as those with median unit price lower than '
f'{products_unit_price_bottom_threshold:.2f}\033[0m (within the bottom 5% of the price range)\n'
f'\033[1mThe number of cheap products:\033[0m {len(cheap_products_list)} ({len(cheap_products_summary) / len(products_summary) :0.1%} of the product range)\n'
f'\033[1mThe median unit price of expensive products:\033[0m {cheap_products_unit_price_median :0.1f} '
f'({general_unit_price_median / cheap_products_unit_price_median :0.1f} times lower than that of an average product ({general_unit_price_median :0.1f}))')
print('='*116)
====================================================================================================================
We consider cheap products as those with median unit price lower than 0.39 (within the bottom 5% of the price range)
The number of cheap products: 134 (3.4% of the product range)
The median unit price of expensive products: 0.2 (9.9 times lower than that of an average product (2.1))
====================================================================================================================
# checking the share of such entries with 'unit_price' below the lower limit
= df_ecom_filtered.query('stock_code_description in @cheap_products_list')
cheap_products_entries
share_evaluation(cheap_products_entries, df_ecom_filtered, = True,
show_qty_rev =True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code_description', 'nunique'): 'Products Coverage',
('customer_id', 'nunique'): 'Customers Coverage'},
(=True,
show_pie_charts_notes=True,
show_boxplots=True,
show_outliers=True, example_type='sample', example_limit=5) show_example
======================================================================================================================================================
Evaluation of share: cheap_products_entries
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 9603 (1.8% of all entries)
Quantity: 327021 (6.0% of the total quantity)
Revenue: 81576.0 (0.8% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into cheap_products_entries
.df_ecom_filtered
is generated in cheap_products_entries
.df_ecom_filtered
occurs in cheap_products_entries
. Every entry is counted separately, even if they are associated with the same order.cheap_products_entries
, it still counts as one full unique order in this chart.cheap_products_entries
, it still counts as one full unique product in this chart.cheap_products_entries
, they still count as one full unique customer in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code initial_description quantity invoice_date unit_price customer_id invoice_year \
485360 577598 20668 DISCO BALL CHRISTMAS DECORATION 24 2019-11-19 08:19:00 0.12 13430 2019
498684 578532 85111 SILVER GLITTER FLOWER VOTIVE HOLDER 36 2019-11-22 14:40:00 0.29 18130 2019
273397 560828 23187 FRENCH STYLE STORAGE JAR BONBONS 48 2019-07-19 11:55:00 0.29 14298 2019
63382 541567 22616 PACK OF 12 LONDON TISSUES 24 2019-01-17 11:51:00 0.29 12681 2019
142380 548610 84926D LA PALMIERA TILE COASTER 4 2019-03-30 11:28:00 1.25 15860 2019
invoice_month invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue \
485360 11 2019-11 47 2019-Week-47 2019-11-19 1 Tuesday 2.88
498684 11 2019-11 47 2019-Week-47 2019-11-22 4 Friday 10.44
273397 7 2019-07 29 2019-Week-29 2019-07-19 4 Friday 13.92
63382 1 2019-01 3 2019-Week-03 2019-01-17 3 Thursday 6.96
142380 3 2019-03 13 2019-Week-13 2019-03-30 5 Saturday 5.00
description stock_code_description
485360 DISCO BALL CHRISTMAS DECORATION 20668__DISCO BALL CHRISTMAS DECORATION
498684 SILVER GLITTER FLOWER VOTIVE HOLDER 85111__SILVER GLITTER FLOWER VOTIVE ...
273397 FRENCH STYLE STORAGE JAR BONBONS 23187__FRENCH STYLE STORAGE JAR BONBONS
63382 PACK OF 12 LONDON TISSUES 22616__PACK OF 12 LONDON TISSUES
142380 LA PALMIERA TILE COASTER 84926D__LA PALMIERA TILE COASTER
======================================================================================================================================================
Let’s create visualization of price distributions for randomly selected cheap products.
# checking unit price distribution for top expensive products
'stock_code_description', 'unit_price', title_extension='among cheap products', sample_type='sample', random_state=7, n_items=20, show_outliers=False, plot_totals=False) plot_totals_distribution(cheap_products_entries,
Observations
For time-saving purposes, we will base revenue study on the already cleaned data, and focus our analysis on revenue distribution and the main revenue outliers.
We’ve covered a significant portion of revenue analysis and associated data cleaning when examining quantity (for instance when investigating mutually exclusive entries, different non-product related operations, and wholesales that correspond to both quantity and revenue). This allows us to conduct a more compact review of revenue in the current piece of study.
# checking outliers with IQR approach + descriptive statistics
=df_ecom_filtered, parameter='revenue', x_limits=[0,75], title_extension='', bins=[1500, 6000]) distribution_IQR(df
Note: A sample data slice 2% of "df_ecom_filtered" was used for histogram plotting instead of the full DataFrame.
This significantly reduced plotting time for the large dataset. The accuracy of the visualization might be slightly reduced, meanwhile it should be sufficient for exploratory analysis.
==================================================
Statistics on revenue
in df_ecom_filtered
count 522565.00
mean 19.06
std 65.30
min 0.00
25% 3.90
50% 9.90
75% 17.70
max 7144.72
Name: revenue, dtype: float64
--------------------------------------------------
The distribution is extremely skewed to the right
(skewness: 31.6)
Note: outliers affect skewness calculation
--------------------------------------------------
Min border: -17
Max border: 39
--------------------------------------------------
The outliers are considered to be values above 39
We have 40703 values that we can consider outliers
Which makes 7.8% of the total "revenue" data
==================================================
# checking outliers with the percentile approach
='revenue', lower_percentile=3, upper_percentile=97, print_limits=True, frame_len=100) percentile_outliers(df_ecom_filtered, parameter
==============================================================================================================
Data on revenue
outliers based on the “percentile approach”
The outliers are considered to be values below 0.84 and above 82.8
We have 30350 values that we can consider outliers
Which makes 5.8% of the total "revenue" data
--------------------------------------------------------------------------------------------------------------
Limits: {'df_ecom_filtered_revenue_lower_limit': 0.84, 'df_ecom_filtered_revenue_upper_limit': 82.8}
==============================================================================================================
# checking the share of entries with 'revenue' above the upper limit
= df_ecom_filtered.query('revenue > @df_ecom_filtered_revenue_upper_limit')
top_revenue_outliers
share_evaluation(top_revenue_outliers, df_ecom_filtered, = True,
show_qty_rev =True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code_description', 'nunique'): 'Products Coverage',
('customer_id', 'nunique'): 'Customers Coverage'},
(=True,
show_pie_charts_notes=True,
show_boxplots=True,
show_outliers=True, example_type='sample', example_limit=5) show_example
======================================================================================================================================================
Evaluation of share: top_revenue_outliers
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 15686 (3.0% of all entries)
Quantity: 1670699 (30.9% of the total quantity)
Revenue: 3486877.6 (35.0% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into top_revenue_outliers
.df_ecom_filtered
is generated in top_revenue_outliers
.df_ecom_filtered
occurs in top_revenue_outliers
. Every entry is counted separately, even if they are associated with the same order.top_revenue_outliers
, it still counts as one full unique order in this chart.top_revenue_outliers
, it still counts as one full unique product in this chart.top_revenue_outliers
, they still count as one full unique customer in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code initial_description quantity invoice_date unit_price customer_id invoice_year invoice_month \
350016 567610 20727 LUNCH BAG BLACK SKULL. 100 2019-09-19 11:30:00 1.45 17511 2019 9
96120 544477 21731 RED TOADSTOOL LED NIGHT LIGHT 144 2019-02-19 10:07:00 1.25 16029 2019 2
342951 566922 23355 HOT WATER BOTTLE KEEP CALM 24 2019-09-13 14:58:00 4.15 16156 2019 9
198020 553997 21937 STRAWBERRY PICNIC BAG 50 2019-05-18 11:34:00 2.55 12656 2019 5
96228 544480 21715 GIRLS VINTAGE TIN SEASIDE BUCKET 64 2019-02-19 10:32:00 2.10 14646 2019 2
invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue \
350016 2019-09 38 2019-Week-38 2019-09-19 3 Thursday 145.00
96120 2019-02 8 2019-Week-08 2019-02-19 1 Tuesday 180.00
342951 2019-09 37 2019-Week-37 2019-09-13 4 Friday 99.60
198020 2019-05 20 2019-Week-20 2019-05-18 5 Saturday 127.50
96228 2019-02 8 2019-Week-08 2019-02-19 1 Tuesday 134.40
description stock_code_description
350016 LUNCH BAG BLACK SKULL. 20727__LUNCH BAG BLACK SKULL.
96120 RED TOADSTOOL LED NIGHT LIGHT 21731__RED TOADSTOOL LED NIGHT LIGHT
342951 HOT WATER BOTTLE KEEP CALM 23355__HOT WATER BOTTLE KEEP CALM
198020 STRAWBERRY PICNIC BAG 21937__STRAWBERRY PICNIC BAG
96228 GIRLS VINTAGE TIN SEASIDE BUCKET 21715__GIRLS VINTAGE TIN SEASIDE BUCKET
======================================================================================================================================================
# checking the most visually obvious outliers
'revenue > 6000'), df_ecom_filtered, show_qty_rev=True, show_example=True) share_evaluation(df_ecom_filtered.query(
======================================================================================================================================================
Evaluation of share: the data slice mentioned in the call function
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 3 (0.0% of all entries)
Quantity: 7640 (0.1% of the total quantity)
Revenue: 20223.5 (0.2% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
Random examples:
invoice_no stock_code initial_description quantity invoice_date unit_price customer_id invoice_year \
160546 550461 21108 FAIRY CAKE FLANNEL ASSORTED COLOUR 3114 2019-04-16 13:20:00 2.10 15749 2019
52711 540815 21108 FAIRY CAKE FLANNEL ASSORTED COLOUR 3114 2019-01-09 12:55:00 2.10 15749 2019
348325 567423 23243 SET OF TEA COFFEE SUGAR TINS PANTRY 1412 2019-09-18 11:05:00 5.06 17450 2019
invoice_month invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue \
160546 4 2019-04 16 2019-Week-16 2019-04-16 1 Tuesday 6539.40
52711 1 2019-01 2 2019-Week-02 2019-01-09 2 Wednesday 6539.40
348325 9 2019-09 38 2019-Week-38 2019-09-18 2 Wednesday 7144.72
description stock_code_description
160546 FAIRY CAKE FLANNEL ASSORTED COLOUR 21108__FAIRY CAKE FLANNEL ASSORTED C...
52711 FAIRY CAKE FLANNEL ASSORTED COLOUR 21108__FAIRY CAKE FLANNEL ASSORTED C...
348325 SET OF TEA COFFEE SUGAR TINS PANTRY 23243__SET OF TEA COFFEE SUGAR TINS ...
======================================================================================================================================================
# checking the share of entries with revenue below the lower limit
= df_ecom_filtered.query('revenue < @df_ecom_filtered_revenue_lower_limit')
bottom_revenue_outliers =True, show_period=False,
share_evaluation(bottom_revenue_outliers, df_ecom_filtered, show_qty_rev=False, example_type='head', example_limit=10, frame_len=75) show_example
===========================================================================
Evaluation of share: bottom_revenue_outliers
in df_ecom_filtered
---------------------------------------------------------------------------
Number of entries: 14664 (2.8% of all entries)
Quantity: 16685 (0.3% of the total quantity)
Revenue: 9659.6 (0.1% of the total revenue)
===========================================================================
Observations
We define top-revenue purchases as entries with revenue in the top 5% across all entries.
# checking top-revenue purchases - top 5% by revenue
= np.percentile(df_ecom_filtered['revenue'], 95)
top_revenue_threshold = df_ecom_filtered.query('revenue > @top_revenue_threshold').sort_values(by='revenue', ascending=False)
top_revenue_purchases
print('='*114)
print(f'\033[1mWe consider top-revenue purchases as those with revenue more than {top_revenue_threshold :.0f} (top 5% by revenue volume across all entries)\033[0m')
print('='*114)
==================================================================================================================
We consider top-revenue purchases as those with revenue more than 59 (top 5% by revenue volume across all entries)
==================================================================================================================
# checking the share of top-revenue purchases according to revenue amounts
share_evaluation(top_revenue_purchases, df_ecom_filtered, =True,
show_qty_rev=True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code_description', 'nunique'): 'Products Coverage',
('customer_id', 'nunique'): 'Customers Coverage'},
(=True,
show_pie_charts_notes=True,
show_boxplots=True, example_type='head', example_limit=3) show_example
======================================================================================================================================================
Evaluation of share: top_revenue_purchases
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 26082 (5.0% of all entries)
Quantity: 2039607 (37.7% of the total quantity)
Revenue: 4206944.8 (42.2% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into top_revenue_purchases
.df_ecom_filtered
is generated in top_revenue_purchases
.df_ecom_filtered
occurs in top_revenue_purchases
. Every entry is counted separately, even if they are associated with the same order.top_revenue_purchases
, it still counts as one full unique order in this chart.top_revenue_purchases
, it still counts as one full unique product in this chart.top_revenue_purchases
, they still count as one full unique customer in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
Top rows:
invoice_no stock_code initial_description quantity invoice_date unit_price customer_id invoice_year \
348325 567423 23243 SET OF TEA COFFEE SUGAR TINS PANTRY 1412 2019-09-18 11:05:00 5.06 17450 2019
160546 550461 21108 FAIRY CAKE FLANNEL ASSORTED COLOUR 3114 2019-04-16 13:20:00 2.10 15749 2019
52711 540815 21108 FAIRY CAKE FLANNEL ASSORTED COLOUR 3114 2019-01-09 12:55:00 2.10 15749 2019
invoice_month invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue \
348325 9 2019-09 38 2019-Week-38 2019-09-18 2 Wednesday 7144.72
160546 4 2019-04 16 2019-Week-16 2019-04-16 1 Tuesday 6539.40
52711 1 2019-01 2 2019-Week-02 2019-01-09 2 Wednesday 6539.40
description stock_code_description
348325 SET OF TEA COFFEE SUGAR TINS PANTRY 23243__SET OF TEA COFFEE SUGAR TINS ...
160546 FAIRY CAKE FLANNEL ASSORTED COLOUR 21108__FAIRY CAKE FLANNEL ASSORTED C...
52711 FAIRY CAKE FLANNEL ASSORTED COLOUR 21108__FAIRY CAKE FLANNEL ASSORTED C...
======================================================================================================================================================
# studying revenue distribution in top-revenue purchases
=top_revenue_purchases, parameter='revenue', x_limits=[0, 150], bins=[2000, 6000], speed_up_plotting=True, target_sample=5000, outliers_info=False) distribution_IQR(df
Note: A sample data slice 19% of "top_revenue_purchases" was used for histogram plotting instead of the full DataFrame.
This significantly reduced plotting time for the large dataset. The accuracy of the visualization might be slightly reduced, meanwhile it should be sufficient for exploratory analysis.
==================================================
Statistics on revenue
in top_revenue_purchases
count 26082.00
mean 161.30
std 249.32
min 59.40
25% 70.92
50% 99.00
75% 165.00
max 7144.72
Name: revenue, dtype: float64
--------------------------------------------------
The distribution is extremely skewed to the right
(skewness: 9.5)
Note: outliers affect skewness calculation
==================================================
We see that the top-revenue purchases generate similar amount of quantity and revenue with wholesale purchases (30-40% of totals for both metrics for both datasets). Let’s examine how much the purchases in these datasets are in common.
# defining common entries among top-revenue purchases and wholesale purchases
= wholesale_purchases.index.intersection(top_revenue_purchases.index)
common_entries
print(f'\033[1mThe `top_revenue_purchases` have {len(common_entries)/len(top_revenue_purchases) :0.1%} entries in common with the `wholesale_purchases`.\033[0m')
The `top_revenue_purchases` have 58.9% entries in common with the `wholesale_purchases`.
Observations
Top-revenue purchases, representing just ~5% of all entries, generate ~38% of the total quantity and ~42% of the total revenue.
The mean revenue of top-revenue purchases(~161) is significantly higher the median (99), indicating distribution skewness and the impact of major purchases.
Share of products experienced at least one top-revenue purchase: ~52%
Share of customers made at least one top-revenue purchase: ~46%. That is noticeably lower than that for wholesale purchases (58%).
58.9% of top-revenue purchases overlap with wholesale purchases.
Let’s examine the customers with the highest purchase revenues. We define high-revenue customers as those whose purchase revenue falls within top 5% of all customers. We already have the DataFrame summarizing the main parameters by customers, now we will define top-revenue performers.
# calculating the top revenue threshold
= round(np.percentile(customers_summary['revenue'], 95), 0) high_revenue_customers_rev_threshold
# defining high-revenue customers - as the top 5% by revenue
= customers_summary.query('revenue > @high_revenue_customers_rev_threshold').sort_values(by='revenue', ascending=False)
high_revenue_customers_summary = high_revenue_customers_summary['customer_id'].tolist()
high_revenue_customers_list
= df_ecom_filtered.query('customer_id in @high_revenue_customers_list')
high_revenue_customers_entries
print('='*131)
print(f'\033[1mWe consider high-revenue customers as those who generated more than {high_revenue_customers_rev_threshold:.0f} revenue in total (the top 5% of customers)\033[0m')
print('-'*131)
print()
print(f'\033[1mDescriptive statistics on purchases made by high-revenue customers:\033[0m')
'quantity', 'revenue']].describe()
high_revenue_customers_entries[[print('='*131)
===================================================================================================================================
We consider high-revenue customers as those who generated more than 5722 revenue in total (the top 5% of customers)
-----------------------------------------------------------------------------------------------------------------------------------
Descriptive statistics on purchases made by high-revenue customers:
quantity | revenue | |
---|---|---|
count | 103721.00 | 103721.00 |
mean | 22.43 | 39.84 |
std | 69.85 | 128.26 |
min | 1.00 | 0.06 |
25% | 2.00 | 5.90 |
50% | 7.00 | 15.00 |
75% | 16.00 | 30.00 |
max | 4800.00 | 7144.72 |
===================================================================================================================================
# checking the share of purchases made by high-revenue customers
share_evaluation(high_revenue_customers_entries, df_ecom_filtered, =True,
show_qty_rev=True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code_description', 'nunique'): 'Products Coverage',
('customer_id', 'nunique'): 'Customers Coverage'},
(=True,
show_pie_charts_notes=True,
show_boxplots=True, example_type='head', example_limit=5) show_example
======================================================================================================================================================
Evaluation of share: high_revenue_customers_entries
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 103721 (19.8% of all entries)
Quantity: 2325988 (43.0% of the total quantity)
Revenue: 4132103.0 (41.5% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into high_revenue_customers_entries
.df_ecom_filtered
is generated in high_revenue_customers_entries
.df_ecom_filtered
occurs in high_revenue_customers_entries
. Every entry is counted separately, even if they are associated with the same order.high_revenue_customers_entries
, it still counts as one full unique order in this chart.high_revenue_customers_entries
, it still counts as one full unique product in this chart.high_revenue_customers_entries
, they still count as one full unique customer in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
Top rows:
invoice_no stock_code initial_description quantity invoice_date unit_price customer_id invoice_year invoice_month \
26 536370 22728 ALARM CLOCK BAKELIKE PINK 24 2018-11-29 08:45:00 3.75 12583 2018 11
27 536370 22727 ALARM CLOCK BAKELIKE RED 24 2018-11-29 08:45:00 3.75 12583 2018 11
28 536370 22726 ALARM CLOCK BAKELIKE GREEN 12 2018-11-29 08:45:00 3.75 12583 2018 11
29 536370 21724 PANDA AND BUNNIES STICKER SHEET 12 2018-11-29 08:45:00 0.85 12583 2018 11
30 536370 21883 STARS GIFT TAPE 24 2018-11-29 08:45:00 0.65 12583 2018 11
invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue description \
26 2018-11 48 2018-Week-48 2018-11-29 3 Thursday 90.00 ALARM CLOCK BAKELIKE PINK
27 2018-11 48 2018-Week-48 2018-11-29 3 Thursday 90.00 ALARM CLOCK BAKELIKE RED
28 2018-11 48 2018-Week-48 2018-11-29 3 Thursday 45.00 ALARM CLOCK BAKELIKE GREEN
29 2018-11 48 2018-Week-48 2018-11-29 3 Thursday 10.20 PANDA AND BUNNIES STICKER SHEET
30 2018-11 48 2018-Week-48 2018-11-29 3 Thursday 15.60 STARS GIFT TAPE
stock_code_description
26 22728__ALARM CLOCK BAKELIKE PINK
27 22727__ALARM CLOCK BAKELIKE RED
28 22726__ALARM CLOCK BAKELIKE GREEN
29 21724__PANDA AND BUNNIES STICKER SHEET
30 21883__STARS GIFT TAPE
======================================================================================================================================================
Also, let’s examine how many of the top contributing high-revenue and high-volume customers are the same, we will do that by comparing the 20 top of each of these groups. We will also display the revenue totals and distributions of top high-revenue customers.
# getting a list of the top 20 revenue-generating customers
= set(high_revenue_customers_summary.sort_values(by='quantity', ascending=False).head(20)['customer_id']) top_20_high_revenue_customers
= top_20_high_revenue_customers.intersection(top_20_high_revenue_customers)
common_customers_revenue = len(common_customers_revenue)
number_of_common_customers = number_of_common_customers / 20
share_of_common_customers
print('='*115)
print(f'\033[1mShare of common customers among the top high-revenue customers and the top business customers:\033[0m {share_of_common_customers :0.1%} ({number_of_common_customers} out of 20)')
print('='*115)
===================================================================================================================
Share of common customers among the top high-revenue customers and the top business customers: 100.0% (20 out of 20)
===================================================================================================================
Let’s display the revenue totals and distributions of the top high-revenue customers.
'customer_id', 'revenue', n_items=20, show_outliers=True) plot_totals_distribution(high_revenue_customers_entries,
There are 8 out of 20 customers that are in common among the top high-revenue customers and the top business customers, which makes 40% of them. We also see that there are very evident leaders among top business customers. And it looks like the share of quantity they are associated with in common is much more than those 40%. Let’s check it out.
= df_ecom_filtered.query('customer_id in @common_customers_revenue')
common_top_20_revenue_customers_entries
share_evaluation(common_top_20_revenue_customers_entries, df_ecom_filtered, = True,
show_qty_rev =True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code_description', 'nunique'): 'Products Coverage',
('customer_id', 'nunique'): 'Customers Coverage'},
(=True) show_pie_charts_notes
======================================================================================================================================================
Evaluation of share: common_top_20_revenue_customers_entries
in df_ecom_filtered
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 19664 (3.8% of all entries)
Quantity: 1107639 (20.5% of the total quantity)
Revenue: 1880824.6 (18.9% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered
falls into common_top_20_revenue_customers_entries
.df_ecom_filtered
is generated in common_top_20_revenue_customers_entries
.df_ecom_filtered
occurs in common_top_20_revenue_customers_entries
. Every entry is counted separately, even if they are associated with the same order.common_top_20_revenue_customers_entries
, it still counts as one full unique order in this chart.common_top_20_revenue_customers_entries
, it still counts as one full unique product in this chart.common_top_20_revenue_customers_entries
, they still count as one full unique customer in this chart.======================================================================================================================================================
Observations
Note: A significant share of purchases is performed by undefined customers-~25% of all entries, 8% of total quantity, and ~15% of total revenue.
In fact, we have accomplished the majority of what we planned for Identifier Analysis within Distribution Analysis, as it was necessary at that stage. Currently, we will conduct an additional review to ensure that this analysis is concise.
invoice_no
column
Checking atypical values in the invoice_no
column in the original df_ecom
DataFrame.
= df_ecom.copy()
df_ecom_copy 'invoice_no_length'] = df_ecom_copy['invoice_no'].str.len()
df_ecom_copy['invoice_no_is_numeric'] = df_ecom_copy['invoice_no'].str.isnumeric()
df_ecom_copy[= (1- df_ecom_copy['invoice_no_is_numeric'].mean())
non_numeric_share
print('='*table_width)
f'**Analysis of the `invoice_no` column of the original `df_ecom` Dataframe**:\n'))
display(Markdown('invoice_no_length'].value_counts()
df_ecom_copy[print()
'invoice_no_is_numeric'].value_counts().reset_index()
df_ecom_copy[
print('-'*table_width)
print(f'\033[1mShare of non-numeric values in the `invoice_no` column:\033[0m {non_numeric_share *100 :0.1f}%')
print(f'\n\033[1mSample entries with atypical number of letters in `invoice_no` column\033[0m:')
'invoice_no_length'] != 6].sample(5, random_state = 7)
df_ecom_copy[df_ecom_copy[print('='*table_width)
======================================================================================================================================================
Analysis of the invoice_no
column of the original df_ecom
Dataframe:
invoice_no_length
6 525933
7 9252
Name: count, dtype: int64
invoice_no_is_numeric | count | |
---|---|---|
0 | True | 525933 |
1 | False | 9252 |
------------------------------------------------------------------------------------------------------------------------------------------------------
Share of non-numeric values in the `invoice_no` column: 1.7%
Sample entries with atypical number of letters in `invoice_no` column:
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | invoice_no_length | invoice_no_is_numeric | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
152849 | C549692 | 21668 | RED STRIPE CERAMIC DRAWER KNOB | -1 | 2019-04-09 13:43:00 | 1.06 | 13668 | 2019 | 4 | 2019-04 | 15 | 2019-Week-15 | 2019-04-09 | 1 | Tuesday | -1.06 | 7 | False |
115510 | C546131 | 21539 | RED RETROSPOT BUTTER DISH | -1 | 2019-03-07 15:08:00 | 4.95 | 16057 | 2019 | 3 | 2019-03 | 10 | 2019-Week-10 | 2019-03-07 | 3 | Thursday | -4.95 | 7 | False |
242253 | C558327 | 21926 | RED/CREAM STRIPE CUSHION COVER | -12 | 2019-06-26 12:04:00 | 1.25 | 17900 | 2019 | 6 | 2019-06 | 26 | 2019-Week-26 | 2019-06-26 | 2 | Wednesday | -15.00 | 7 | False |
19390 | C537856 | 37370 | RETRO COFFEE MUGS ASSORTED | -2 | 2018-12-06 15:59:00 | 1.25 | 14388 | 2018 | 12 | 2018-12 | 49 | 2018-Week-49 | 2018-12-06 | 3 | Thursday | -2.50 | 7 | False |
191595 | C553378 | POST | POSTAGE | -1 | 2019-05-14 15:02:00 | 27.42 | 0 | 2019 | 5 | 2019-05 | 20 | 2019-Week-20 | 2019-05-14 | 1 | Tuesday | -27.42 | 7 | False |
======================================================================================================================================================
Checking atypical values in the invoice_no
column in the filtered df_ecom_filtered
DataFrame.
= df_ecom_filtered.copy()
df_ecom_filtered_copy 'invoice_no_length'] = df_ecom_filtered_copy['invoice_no'].str.len()
df_ecom_filtered_copy['invoice_no_is_numeric'] = df_ecom_filtered['invoice_no'].str.isnumeric()
df_ecom_filtered_copy[= (1- df_ecom_filtered_copy['invoice_no_is_numeric'].mean())
non_numeric_share_filtered
print('='*81)
f'**Analysis of the `invoice_no` column of the filtered `df_ecom_filtered` Dataframe**:\n'))
display(Markdown(
'invoice_no_length'].value_counts().reset_index()
df_ecom_filtered_copy['invoice_no_is_numeric'].value_counts().reset_index()
df_ecom_filtered_copy[
print('-'*81)
print(f'\033[1mShare of non-numeric values in the `invoice_no` column:\033[0m {non_numeric_share_filtered *100 :0.1f}%')
print('='*81)
=================================================================================
Analysis of the invoice_no
column of the filtered df_ecom_filtered
Dataframe:
invoice_no_length | count | |
---|---|---|
0 | 6 | 522565 |
invoice_no_is_numeric | count | |
---|---|---|
0 | True | 522565 |
---------------------------------------------------------------------------------
Share of non-numeric values in the `invoice_no` column: 0.0%
=================================================================================
Observations
A comparative analysis of the invoice_no
column in the original df_ecom
DataFrame and the filtered df_ecom_filtered
DataFrame reveals that we removed 9252 values (1.7% of the total) of atypical invoices containing non-numeric characters. Our previous analysis shows that they were primarily associated with data corrections involving negative quantity entries.
stock_code
column
Checking atypical values in the stock_code
column of the original df_ecom
DataFrame.
= df_ecom.copy()
df_ecom_copy 'stock_code_length'] = df_ecom_copy['stock_code'].str.len()
df_ecom_copy['stock_code_is_numeric'] = df_ecom_copy['stock_code'].str.isnumeric()
df_ecom_copy[
print('='*67)
f'**Analysis of the `stock_code` column of the original `df_ecom` Dataframe**:\n'))
display(Markdown('stock_code_length'].value_counts().reset_index()
df_ecom_copy['stock_code_is_numeric'].value_counts().reset_index()
df_ecom_copy[
print('-'*67)
= (1- df_ecom_copy['stock_code_is_numeric'].mean())
non_numeric_share print(f'\033[1mShare of non-numeric values in the `stock_code` column:\033[0m {non_numeric_share *100 :0.1f}%:')
print('='*67)
===================================================================
Analysis of the stock_code
column of the original df_ecom
Dataframe:
stock_code_length | count | |
---|---|---|
0 | 5 | 481110 |
1 | 6 | 50713 |
2 | 4 | 1272 |
3 | 3 | 709 |
4 | 1 | 707 |
5 | 7 | 390 |
6 | 2 | 143 |
7 | 12 | 69 |
8 | 9 | 47 |
9 | 8 | 25 |
stock_code_is_numeric | count | |
---|---|---|
0 | True | 481110 |
1 | False | 54075 |
-------------------------------------------------------------------
Share of non-numeric values in the `stock_code` column: 10.1%:
===================================================================
Checking atypical values in the stock_code
column in the filtered df_ecom_filtered
DataFrame.
= df_ecom_filtered.copy()
df_ecom_filtered_copy 'stock_code_length'] = df_ecom_filtered_copy['stock_code'].str.len()
df_ecom_filtered_copy['stock_code_is_numeric'] = df_ecom_filtered['stock_code'].str.isnumeric()
df_ecom_filtered_copy[
print('='*table_width)
f'**Analysis of the `stock_code` column of the filtered `df_ecom_filtered` Dataframe**:\n'))
display(Markdown('stock_code_length'].value_counts().reset_index()
df_ecom_filtered_copy['stock_code_is_numeric'].value_counts().reset_index()
df_ecom_filtered_copy[
print('-'*table_width)
= (1- df_ecom_filtered_copy['stock_code_is_numeric'].mean())
non_numeric_share print(f'\033[1mShare of non-numeric values in the `stock_code` column:\033[0m {non_numeric_share *100 :0.1f}%')
print('-'*table_width)
# checking examples of entries for stock codes with different lengths
for length in set(df_ecom_filtered_copy['stock_code_length']):
print(f'\n\033[1mSample entries with stock code of length \"{length}\":')
'stock_code_length'] == length].sample(1, random_state = 7)
df_ecom_filtered_copy[df_ecom_filtered_copy[print('='*table_width)
======================================================================================================================================================
Analysis of the stock_code
column of the filtered df_ecom_filtered
Dataframe:
stock_code_length | count | |
---|---|---|
0 | 5 | 472247 |
1 | 6 | 49868 |
2 | 7 | 383 |
3 | 12 | 31 |
4 | 8 | 20 |
5 | 9 | 13 |
6 | 4 | 3 |
stock_code_is_numeric | count | |
---|---|---|
0 | True | 472247 |
1 | False | 50318 |
------------------------------------------------------------------------------------------------------------------------------------------------------
Share of non-numeric values in the `stock_code` column: 9.6%
------------------------------------------------------------------------------------------------------------------------------------------------------
Sample entries with stock code of length "4":
invoice_no | stock_code | initial_description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | description | stock_code_description | stock_code_length | stock_code_is_numeric | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
359871 | 568200 | PADS | PADS TO MATCH ALL CUSHIONS | 1 | 2019-09-23 14:58:00 | 0.00 | 16198 | 2019 | 9 | 2019-09 | 39 | 2019-Week-39 | 2019-09-23 | 0 | Monday | 0.00 | PADS TO MATCH ALL CUSHIONS | PADS__PADS TO MATCH ALL CUSHIONS | 4 | False |
Sample entries with stock code of length "5":
invoice_no | stock_code | initial_description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | description | stock_code_description | stock_code_length | stock_code_is_numeric | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
261601 | 559876 | 20719 | WOODLAND CHARLOTTE BAG | 1 | 2019-07-11 11:09:00 | 0.85 | 15752 | 2019 | 7 | 2019-07 | 28 | 2019-Week-28 | 2019-07-11 | 3 | Thursday | 0.85 | WOODLAND CHARLOTTE BAG | 20719__WOODLAND CHARLOTTE BAG | 5 | True |
Sample entries with stock code of length "6":
invoice_no | stock_code | initial_description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | description | stock_code_description | stock_code_length | stock_code_is_numeric | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
77859 | 542789 | 16156S | WRAP PINK FAIRY CAKES | 25 | 2019-01-30 10:38:00 | 0.42 | 17511 | 2019 | 1 | 2019-01 | 5 | 2019-Week-05 | 2019-01-30 | 2 | Wednesday | 10.50 | WRAP PINK FAIRY CAKES | 16156S__WRAP PINK FAIRY CAKES | 6 | False |
Sample entries with stock code of length "7":
invoice_no | stock_code | initial_description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | description | stock_code_description | stock_code_length | stock_code_is_numeric | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
348393 | 567426 | 15056BL | EDWARDIAN PARASOL BLACK | 6 | 2019-09-18 11:33:00 | 5.95 | 13767 | 2019 | 9 | 2019-09 | 38 | 2019-Week-38 | 2019-09-18 | 2 | Wednesday | 35.70 | EDWARDIAN PARASOL BLACK | 15056BL__EDWARDIAN PARASOL BLACK | 7 | False |
Sample entries with stock code of length "8":
invoice_no | stock_code | initial_description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | description | stock_code_description | stock_code_length | stock_code_is_numeric | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
24906 | 538349 | DCGS0003 | BOXED GLASS ASHTRAY | 1 | 2018-12-08 14:59:00 | 2.51 | 0 | 2018 | 12 | 2018-12 | 49 | 2018-Week-49 | 2018-12-08 | 5 | Saturday | 2.51 | BOXED GLASS ASHTRAY | DCGS0003__BOXED GLASS ASHTRAY | 8 | False |
Sample entries with stock code of length "9":
invoice_no | stock_code | initial_description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | description | stock_code_description | stock_code_length | stock_code_is_numeric | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
278379 | 561209 | DCGSSGIRL | GIRLS PARTY BAG | 5 | 2019-07-23 16:57:00 | 1.25 | 0 | 2019 | 7 | 2019-07 | 30 | 2019-Week-30 | 2019-07-23 | 1 | Tuesday | 6.25 | GIRLS PARTY BAG | DCGSSGIRL__GIRLS PARTY BAG | 9 | False |
Sample entries with stock code of length "12":
invoice_no | stock_code | initial_description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | description | stock_code_description | stock_code_length | stock_code_is_numeric | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
44725 | 540238 | gift_0001_30 | Dotcomgiftshop Gift Voucher £30.00 | 1 | 2019-01-03 14:44:00 | 25.53 | 0 | 2019 | 1 | 2019-01 | 1 | 2019-Week-01 | 2019-01-03 | 3 | Thursday | 25.53 | Dotcomgiftshop Gift Voucher £30.00 | gift_0001_30__Dotcomgiftshop Gift Vo... | 12 | False |
======================================================================================================================================================
Observations
stock_code
column in the original df_ecom
DataFrame and the filtered df_ecom_filtered
DataFrame shows that we reduced the proportion of atypical non-numeric values in stock_code
from 10.1% to 9.6%.description
column
To enhance the efficiency of our analysis, we will create a function called boxplots
. This function will help us visualize the distribution and medians of parameters over time (thanks to another project, the function is already in place and requires only minor adjustments).
Function: boxplots
def boxplots(df, x_parameter, y_parameter, category=None, figsize=(10,5), title_extension='', color=None, palette='x_palette', order=True, notch=False, show_outliers=True):
"""
The function builds boxplots for each unique value of a selected category (if any is defined) in a given DataFrame.
The boxplot color is assigned based on unique values of the 'x_parameter' to allow for easier comparison.
As input, the function takes:
- df (DataFrame): the DataFrame on which boxplots are built.
- x_parameter (str): the column name to be used on the x-axis of the boxplot.
- y_parameter (str): the column name to be used on the y-axis of the boxplot.
- category (str, optional): a column that defines categories for creating separate boxplots for each category value (default is None).
- figsize (tuple, optional): The size of the figure (default is (10, 5)).
- title_extension (str, optional): additional text to be added to the title (default is empty string).
- color (str or list, optional): a specific color or list of colors to use for the boxplots. If None, colors are assigned according to the palette (default is None).
- palette (str, dict or list, optional): a custom color palette to use for the boxplots. If 'x_palette', creates a palette mapping x_parameter values to colors (default is 'x_palette').
- order (bool, optional): whether to sort boxplots by their medians in ascending order (default is True).
- notch (bool, optional): whether to show notches on boxplots to better indicate medians (default is False).
- show_outliers (bool, optional)): whether to show outliers in the boxplot (default is True).
As output, the function presents:
- Boxplots: one or more boxplots, depending on whether a category is provided. Each unique value of the x_parameter will have its own boxplot, with colors assigned for easier visual distinction.
- If no category is provided (category=None), a single boxplot will be displayed for all data in the DataFrame.
----------------
Note: If both 'color' and 'palette' are set, 'color' will be used for all boxplots and the 'palette' parameter will be ignored.
Example of usage (for creating boxplots of sales by platform):
boxplots(df_sales, x_parameter='platform', y_parameter='sales', category='region', show_outliers=False)
----------------
"""
if color != None:
= None
palette else:
# creating a dictionary to pair each x-axis parameter with its color
= {
x_palette
x_param: colorfor x_param, color in zip(df[x_parameter].unique(), sns.color_palette('tab20', n_colors=len(df[x_parameter].unique())))}
if palette == 'x_palette':
= x_palette
palette
# adjusting the title extension
if title_extension:
= f' {title_extension}'
title_extension
if category == None:
# checking conditions for sorting boxplots by their medians values
if order == True:
= df.groupby(x_parameter)[y_parameter].median().sort_values(ascending=False).index
boxplot_order else:
=None
boxplot_order
# plotting boxplot with relevant subtitles
= plt.subplots(figsize = figsize)
fig, ax f'Boxplot of \"{y_parameter}{title_extension}\" by \"{x_parameter}\"', fontsize = 16)
plt.suptitle(= df[x_parameter], y = df[y_parameter],\
sns.boxplot(x = dict(alpha=0.5), hue = category, order = boxplot_order,
boxprops = notch, showfliers = show_outliers, color = color, palette = palette)
notch =45)
plt.xticks(rotation
# removing a legend if any
if ax.get_legend() is not None:
ax.get_legend().remove()
else:
# preventing data overwriting while running the "for" circle
= df
df_basic
# going through all unique names of a selected category, the further code would be applied to each of them
for unique_name in df[category].unique():
# assigning data the boxplots will be built on
= df_basic[df_basic[category] == unique_name]
df
# checking conditions for sorting boxplots by their medians values
if order==True:
= df.groupby(x_parameter)[y_parameter].median().sort_values(ascending=False).index
boxplot_order else:
=None
boxplot_order
# plotting boxplot with relevant subtitles
= plt.subplots(figsize = figsize)
fig, ax f'Boxplot of \"{y_parameter}{title_extension}\" by \"{x_parameter}\" for the \"{unique_name} {category}\"', fontsize=16)
plt.suptitle(= df[x_parameter], y = df[y_parameter],\
sns.boxplot(x = dict(alpha = 0.5), order = boxplot_order,
boxprops = notch, showfliers = show_outliers, color=None, palette = palette)
notch =45) plt.xticks(rotation
Let’s recollect that overall period of the dataset is: 2018-11-29 - 2019-12-07
In the next step, we will filter our DataFrame so that it includes only the entire calendar months. Since our calculations will be monthly-based, partial data may mislead the model.
By covering 12 months period, all the seasonal fluctuations will be included.
# filtering out entries of not full months
= data_reduction(df_ecom_filtered, lambda df: df.query('invoice_year_month >= "2018-12" and invoice_year_month < "2019-12"')) df_ecom_filtered_12m
Number of entries cleaned out from the "df_ecom_filtered": 24234 (4.6%)
=True, show_period=True) share_evaluation(df_ecom_filtered_12m, df_ecom_filtered, show_qty_rev
==============================================================================================================
Evaluation of share: df_ecom_filtered_12m
in df_ecom_filtered
--------------------------------------------------------------------------------------------------------------
Number of entries: 498331 (95.4% of all entries)
Quantity: 5172014 (95.7% of the total quantity)
Revenue: 9517759.5 (95.5% of the total revenue)
--------------------------------------------------------------------------------------------------------------
Invoice period coverage: 2018-12-01 - 2019-11-30 (97.6%; 364 out of 373 total days; 12 out of 12 total months)
==============================================================================================================
Observations
Let’s create a DataFrame presenting monthly summary.
# grouping and aggregating the data
= df_ecom_filtered_12m.groupby('invoice_year_month').agg({
monthly_summary 'revenue': 'sum',
'quantity': 'sum',
'invoice_no': 'nunique',
'stock_code_description': ['count','nunique'],
'customer_id': 'nunique',
'unit_price': ['mean', 'median']}
'invoice_year_month')
).reset_index().sort_values(
= ['invoice_year_month',
monthly_summary.columns 'revenue',
'quantity',
'unique_invoices',
'entries',
'unique_products',
'unique_customers',
'unit_price_mean', 'unit_price_median']
monthly_summary
invoice_year_month | revenue | quantity | unique_invoices | entries | unique_products | unique_customers | unit_price_mean | unit_price_median | |
---|---|---|---|---|---|---|---|---|---|
0 | 2018-12 | 670676.20 | 299461 | 1282 | 35788 | 2736 | 769 | 3.86 | 2.55 |
1 | 2019-01 | 641890.68 | 338021 | 1205 | 36781 | 2602 | 806 | 3.35 | 2.10 |
2 | 2019-02 | 502201.30 | 277862 | 1071 | 26089 | 2396 | 745 | 3.56 | 2.46 |
3 | 2019-03 | 671649.94 | 373897 | 1411 | 34278 | 2495 | 950 | 3.45 | 2.10 |
4 | 2019-04 | 497476.19 | 293019 | 1179 | 27993 | 2440 | 826 | 3.32 | 2.08 |
5 | 2019-05 | 784946.06 | 416382 | 1744 | 38227 | 2516 | 1080 | 3.49 | 2.10 |
6 | 2019-06 | 659034.58 | 370107 | 1476 | 33526 | 2580 | 972 | 3.29 | 2.08 |
7 | 2019-07 | 722230.94 | 419026 | 1487 | 39748 | 2692 | 970 | 3.06 | 1.95 |
8 | 2019-08 | 754086.87 | 439459 | 1404 | 35297 | 2589 | 940 | 3.14 | 2.08 |
9 | 2019-09 | 963129.03 | 530912 | 1705 | 46410 | 2717 | 1215 | 3.06 | 2.08 |
10 | 2019-10 | 1165477.67 | 656282 | 2131 | 61167 | 2861 | 1431 | 3.10 | 2.08 |
11 | 2019-11 | 1484959.99 | 757586 | 2831 | 83027 | 2931 | 1673 | 3.10 | 2.08 |
Let’s plot together revenue and quantity by month.
# creating a combined line plot of revenue and quantity
= plt.subplots(figsize=(10, 5))
fig, ax1 'Revenue and Quantity by Month', fontsize=16)
plt.title(
# plotting revenue data
= 'darkred'
color_1 'Year-Month')
ax1.set_xlabel('Revenue', color=color_1)
ax1.set_ylabel(
sns.lineplot(=monthly_summary,
data='invoice_year_month',
x='revenue',
y='o',
marker=2.5,
linewidth=9,
markersize=color_1,
color=ax1)
ax
='x', rotation=45)
ax1.tick_params(axis='y', labelcolor=color_1)
ax1.tick_params(axis
# plotting quantity data
= 'teal'
color_2 = ax1.twinx()
ax2 'Quantity', color=color_2)
ax2.set_ylabel(
sns.lineplot(=monthly_summary,
data='invoice_year_month',
x='quantity',
y='o',
marker=2.5,
linewidth=9,
markersize=color_2,
color=ax2)
ax
='y', labelcolor=color_2)
ax2.tick_params(axis
# using engineering notation instead of scientific
ax1.yaxis.set_major_formatter(EngFormatter()); ax2.yaxis.set_major_formatter(EngFormatter())
Observations
From June 2019 there is a strong stable rising trend in both revenue and quantity, peaking in November 2019. The most significant rise in revenue occurs between August 2019 and November 2019. During this period, the number of units sold and revenue almost doubled.
This could be due to factors such as seasonal increase in customer demand (preparation for school and some of major sales), or other factors such as successful marketing campaigns during these months.
We see fluctuations in both revenue and quantity from December 2018 to May 2019, with noticeable recessions in February and April 2019.
The reasoning may lie in factors such as seasonal low demand or external conditions impacting sales that are not obvious yet.
From December 2018 to January 2019 quantity was growing, while revenue was declining.
This could probably be explained by a decrease in the average prices of units customers bought in this period. We can investigate this aspect further.
# creating a line plot of orders number by month
= plt.subplots(figsize=(10, 5))
fig, ax1 'Invoices and Entries by Month', fontsize=16)
plt.title(
# plotting invoices (orders) data
= 'navy'
color_1 'Year-Month')
ax1.set_xlabel('Invoices', color=color_1)
ax1.set_ylabel(
sns.lineplot(=monthly_summary,
data='invoice_year_month',
x='unique_invoices',
y='o',
marker=2.5,
linewidth=9,
markersize=color_1,
color=ax1)
ax
='x', rotation=45)
ax1.tick_params(axis='y', labelcolor=color_1)
ax1.tick_params(axis
# plotting entries (purchases) data
= 'skyblue'
color_2 = ax1.twinx()
ax2 'Entries', color=color_2)
ax2.set_ylabel(
sns.lineplot(=monthly_summary,
data='invoice_year_month',
x='entries',
y='o',
marker=2.5,
linewidth=9,
markersize=color_2,
color=ax2)
ax
='y', labelcolor=color_2)
ax2.tick_params(axis
# using engineering notation instead of scientific
ax1.yaxis.set_major_formatter(EngFormatter()); ax2.yaxis.set_major_formatter(EngFormatter())
Observations
# creating a combined line plot of revenue and quantity
= plt.subplots(figsize=(10, 5))
fig, ax1 'Unique Products and Unique Customers by Month', fontsize=16)
plt.title(
# plotting unique products data
= 'purple'
color_1 'Year-Month')
ax1.set_xlabel(
'Unique Products', color=color_1)
ax1.set_ylabel(
sns.lineplot(=monthly_summary,
data='invoice_year_month',
x='unique_products',
y='o',
marker=2.5,
linewidth=8,
markersize=color_1,
color=ax1)
ax
='x', rotation=45)
ax1.tick_params(axis='y', labelcolor=color_1)
ax1.tick_params(axis
# plotting unique customers data
= 'darkgreen'
color_2 = ax1.twinx()
ax2 'Unique Customers', color=color_2)
ax2.set_ylabel(
sns.lineplot(=monthly_summary,
data='invoice_year_month',
x='unique_customers',
y='o',
marker=2.5,
linewidth=8,
markersize=color_2,
color=ax2)
ax
='y', labelcolor=color_2)
ax2.tick_params(axis
# using engineering notation instead of scientific
ax1.yaxis.set_major_formatter(EngFormatter()); ax2.yaxis.set_major_formatter(EngFormatter())
Observations
The dynamics of the chart are quite similar to those of revenue and quantity by month (a strong upward trend, most growth occurs between August and November 2019), but with sharper distinctions in May and July 2019.
About 12% decrease in the diversity of products from December 2018 to February 2019. This can at least partially explain the discrepancies we observed earlier on the plot displaying revenue and quantity by month during the same period.
Only in the last quarter of our dataset the product range has reached and then exceeded its original level.
We can see overall significant fluctuations in the monthly number of products and unique customers.
Except for two periods (December 2018 - January 2019 and June - July 2019), we observe a clear, strong correlation between the number of unique customers and unique products sold. This is also perfectly aligned with growth in quantity sold and revenue - graphs of unique products and unique customers show very similar dynamics.
💡 Therefore, we can conclude that both volume and revenue growth were driven by simultaneous growth in product range and customer base.
This phenomenon perfectly aligns with the long tail theory, which states that a broader product range attracts diverse customers and can drive growth. This approach can work either as an alternative to or in conjunction with focusing on major products (as suggested by the Pareto principle).
# creating line plots of mean and median unit prices by month
= plt.subplots(figsize=(10, 5))
fig, ax
=monthly_summary, x='invoice_year_month', y='unit_price_mean', marker='d', markersize=8, label='Mean', color='darkgoldenrod', linewidth=2.5)
sns.lineplot(data=monthly_summary, x='invoice_year_month', y='unit_price_median', marker='d', markersize=8, label='Median', color='darkorange', linewidth=2.5)
sns.lineplot(data
'Unit Price Mean & Median by Month', fontsize=16)
ax.set_title('Year-Month')
ax.set_xlabel('Unit Price')
ax.set_ylabel(=45); plt.xticks(rotation
Observations
Looking at the line plots, there’s a steady gap between mean and median prices, with mean consistently higher. We’ve seen this right-skewed distribution before, and now the data confirms this gap was present and fairly constant each month.
Both metrics, especially mean, show a clear downward trend in prices overall. Mean price dropped from about 3.75 to around 3.10 (about about 17% decrease), while median fell from about 2.50 to 2.10 (also about 17% decrease).
The early months (December 2018 to February 2019) demonstrate notable price volatility in both mean and median. After March 2019, mean prices showed reduced volatility, while median found stability around 2.1. By July 2019, both metrics had stabilized - mean at about 3.10 and median at 2.10.
When comparing revenue, quantity, and mean unit price trends, we notice that unit price peaks often don’t align with revenue peaks. For instance, February 2019 saw a significant peak in mean unit price compared to January, while revenue actually declined.
Let’s create a DataFrame presenting summary by month and invoice.
= (
monthly_invoices 'invoice_year_month','invoice_no'])
df_ecom_filtered_12m.groupby(['quantity': ['sum', 'mean', 'median'],
.agg({'revenue': ['sum', 'mean', 'median'],
'unit_price': ['mean', 'median']})
.reset_index())
= ['invoice_year_month',
monthly_invoices.columns 'invoice_no',
'quantity', 'quantity_mean', 'quantity_median',
'revenue', 'revenue_mean', 'revenue_median',
'unit_price_mean', 'unit_price_median']
10) monthly_invoices.head(
invoice_year_month | invoice_no | quantity | quantity_mean | quantity_median | revenue | revenue_mean | revenue_median | unit_price_mean | unit_price_median | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2018-12 | 536847 | 222 | 24.67 | 24.00 | 215.58 | 23.95 | 20.16 | 1.21 | 1.25 |
1 | 2018-12 | 536848 | 280 | 93.33 | 100.00 | 534.00 | 178.00 | 165.00 | 1.95 | 1.65 |
2 | 2018-12 | 536849 | 106 | 35.33 | 39.00 | 397.50 | 132.50 | 146.25 | 3.75 | 3.75 |
3 | 2018-12 | 536851 | 360 | 24.00 | 12.00 | 1368.40 | 91.23 | 78.00 | 7.20 | 4.25 |
4 | 2018-12 | 536852 | 106 | 17.67 | 18.00 | 71.14 | 11.86 | 10.08 | 0.80 | 0.64 |
5 | 2018-12 | 536856 | 343 | 8.79 | 6.00 | 754.87 | 19.36 | 17.70 | 3.73 | 2.95 |
6 | 2018-12 | 536857 | 54 | 3.18 | 3.00 | 128.03 | 7.53 | 5.90 | 3.66 | 1.49 |
7 | 2018-12 | 536858 | 108 | 21.60 | 24.00 | 223.40 | 44.68 | 39.60 | 3.09 | 1.65 |
8 | 2018-12 | 536859 | 186 | 7.75 | 3.50 | 294.25 | 12.26 | 9.20 | 2.74 | 2.33 |
9 | 2018-12 | 536860 | 108 | 10.80 | 3.00 | 254.40 | 25.44 | 11.68 | 3.95 | 3.20 |
='invoice_year_month', title_extension='per invoice', color = 'teal', y_parameter='quantity', order=False, show_outliers=False, figsize=(10, 5)) boxplots(monthly_invoices, x_parameter
Let’s take a closer look at the dynamics of monthly mean and median values of the main metrics by creating a line plot.
= (
monthly_invoices_summary 'invoice_year_month'])
monthly_invoices.groupby(['quantity': ['mean', 'median'],
.agg({'revenue': ['mean', 'median']})
.reset_index())
= ['invoice_year_month', 'quantity_mean', 'quantity_median', 'revenue_mean', 'revenue_median']
monthly_invoices_summary.columns monthly_invoices_summary
invoice_year_month | quantity_mean | quantity_median | revenue_mean | revenue_median | |
---|---|---|---|---|---|
0 | 2018-12 | 233.59 | 111.50 | 523.15 | 256.13 |
1 | 2019-01 | 280.52 | 146.00 | 532.69 | 303.80 |
2 | 2019-02 | 259.44 | 140.00 | 468.91 | 303.58 |
3 | 2019-03 | 264.99 | 140.00 | 476.01 | 291.44 |
4 | 2019-04 | 248.53 | 142.00 | 421.95 | 302.40 |
5 | 2019-05 | 238.75 | 141.00 | 450.08 | 303.50 |
6 | 2019-06 | 250.75 | 141.00 | 446.50 | 278.02 |
7 | 2019-07 | 281.79 | 163.00 | 485.70 | 302.18 |
8 | 2019-08 | 313.00 | 180.50 | 537.10 | 305.98 |
9 | 2019-09 | 311.39 | 193.00 | 564.89 | 324.14 |
10 | 2019-10 | 307.97 | 177.00 | 546.92 | 312.82 |
11 | 2019-11 | 267.60 | 156.00 | 524.54 | 295.14 |
# creating line plots of mean and median quantity per invoice by month
= plt.subplots(figsize=(10, 5))
fig, ax
=monthly_invoices_summary, x='invoice_year_month', y='quantity_mean', marker='d', markersize=8, label='Mean', color='darkseagreen', linewidth=2.5)
sns.lineplot(data=monthly_invoices_summary, x='invoice_year_month', y='quantity_median', marker='d', markersize=8, label='Median', color='teal', linewidth=2.5)
sns.lineplot(data
'Quantity per Invoice Mean & Median by Month', fontsize=16)
ax.set_title('Year-Month')
ax.set_xlabel('Quantity')
ax.set_ylabel(=45); plt.xticks(rotation
Observations
According to the boxplots analysis, the distribution of quantity of units in invoices remains quite consistent across the months, with the middle interquartile range (IQR) of values staying within a similar range. We can mention just December 2018 being slightly out of the picture.
The ranges of values (between the whiskers, covering all data except outliers) show notable variation. For example, this range is widest in September 2019, exceeding that of December 2018 , which has the narrowest range, approximately by 50%.
According to the lineplot analysis, the median quantity of units per invoice fluctuates but remains relatively stable around 140-150 for about half of the observed time range. However, notable variations occur:
💡 An interesting observation is the decrease in quantity of items per invoice in October-November 2019, both in terms of range and median values. This is particularly notable since total revenue and quantity were growing explosively during the same period.
💡💡 Once again, we must note that diversity of products strongly impacts sales in terms of both revenues and quantity. Recalling our earlier observation that monthly total orders, unique products, and unique customers were also growing significantly in this period, we arrive at one of the major discoveries of the project so far: In the final period of the dataset (September - November 2019) the expanding range of stock codes emerges as a key driver of growth in unique customers, revenues, and quantity sold. However, we cannot conclude this is the only factor, as we lack information on other potential influences, such as marketing campaigns.
We will study both distributions and medians at this stage.
='invoice_year_month', title_extension='per invoice', color='darkred', y_parameter='revenue', order=False, show_outliers=False, figsize=(10, 5)) boxplots(monthly_invoices, x_parameter
# creating line plots of mean and median revenue per invoice by month
= plt.subplots(figsize=(10, 5))
fig, ax
=monthly_invoices_summary, x='invoice_year_month', y='revenue_mean', marker='d', markersize=8, label='Mean', color='crimson', linewidth=2.5)
sns.lineplot(data=monthly_invoices_summary, x='invoice_year_month', y='revenue_median', marker='d', markersize=8, label='Median', color='darkred', linewidth=2.5)
sns.lineplot(data
'Revenue per Invoice Mean & Median by Month', fontsize=16)
ax.set_title('Year-Month')
ax.set_xlabel('Revenue')
ax.set_ylabel(=45); plt.xticks(rotation
Observations
According to the boxplots analysis, the distribution of revenue per invoice stays relatively consistent across most months, with differences generally within 20%. However, certain months especially September and November 2019 show a broader range, indicating some unusually high-revenue invoices. Conversely, December 2018, April, June, and November 2019 show narrower revenue distributions in those periods.
According to the lineplot analysis, median invoice revenue follows the similar pattern to median invoice quantity, though it experiences two notable dips in March and June 2019, of around 6% and 10%, respectively. There is also a decline in median invoice revenue in October and November, mirroring the decrease seen in median invoice quantity. We see similar picture, when comparing dynamics of mean invoice revenue with mean invoice quantity; except of April 2019, when mean revenue dropped, without similar drop in quantity per revenue, what can be explained by drop of unit priced in this month, that we’ve seen above.
The significant gap between mean and median values (ranging from ~150-250) indicates a positively skewed distribution, with some high-value invoices. The relative stability of the median compared to the more volatile mean suggests that while most customers maintained consistent purchasing behaviors, the business experienced fluctuating large orders that substantially impacted overall revenue.
It’s important to highlight that during the dips in mean and median invoice revenue are not directly aligned with dips in overall revenue. For example, in March 2019 the overall revenue was at local peak, while median invoice revenue was slightly decreasing and mean invoice revenue was almost stable.
Additionally, in the beginning of the dataset (December 2018 to February 2019), we see median invoice revenue and median number of units per invoice are rapidly rising (about 20-25%). Meanwhile, total revenue declines forming a similar graph line with monthly invoices number during the same period (decline by about 20-25% both) . In the same time we see a rapid decrease in unique number of (15% decrease) This can be explained by a limited product assortment (number of unique products) and a relatively low overall level of orders during the same period.
We will study both distributions and medians at this stage.
We will create a DataFrame presenting daily number of orders, revenue, quantity and number of unique Customers. We will also consider grouping by month and week, it may be useful later on.
= df_ecom_filtered_12m.groupby(['invoice_day', 'invoice_day_name','invoice_day_of_week']).agg({
daily_summary_12m'stock_code_description':'count',
'invoice_no':'nunique',
'revenue':'sum',
'quantity': 'sum',
'customer_id':'nunique'
'invoice_day')
}).reset_index().sort_values(
= ['invoice_day', 'invoice_day_name', 'invoice_day_of_week', 'entries','unique_invoices', 'revenue', 'quantity', 'unique_customers']
daily_summary_12m.columns daily_summary_12m
invoice_day | invoice_day_name | invoice_day_of_week | entries | unique_invoices | revenue | quantity | unique_customers | |
---|---|---|---|---|---|---|---|---|
0 | 2018-12-01 | Saturday | 5 | 2123 | 68 | 44788.90 | 16136 | 51 |
1 | 2018-12-03 | Monday | 0 | 2591 | 88 | 30908.67 | 16163 | 76 |
2 | 2018-12-04 | Tuesday | 1 | 3757 | 102 | 51667.12 | 21592 | 83 |
3 | 2018-12-05 | Wednesday | 2 | 2835 | 82 | 81454.99 | 25160 | 66 |
4 | 2018-12-06 | Thursday | 3 | 2519 | 116 | 44153.98 | 22990 | 100 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
292 | 2019-11-26 | Tuesday | 1 | 3241 | 114 | 54429.43 | 29311 | 97 |
293 | 2019-11-27 | Wednesday | 2 | 4204 | 135 | 68098.41 | 30782 | 110 |
294 | 2019-11-28 | Thursday | 3 | 3325 | 107 | 56088.10 | 28324 | 92 |
295 | 2019-11-29 | Friday | 4 | 2782 | 121 | 50605.15 | 26979 | 112 |
296 | 2019-11-30 | Saturday | 5 | 2777 | 120 | 55917.17 | 28671 | 96 |
297 rows × 8 columns
Now we will plot totals and relevant distributions alongside for each parameter we study.
# plotting totals and relevant distributions for each parameter by day of week
for parameter in ['entries', 'unique_invoices', 'revenue', 'quantity', 'unique_customers']:
'invoice_day_name', parameter, fig_height=400, fig_width = 900,
plot_totals_distribution(daily_summary_12m, =False, title_start=False, plot_totals=True, plot_distribution=True, consistent_colors=True) show_outliers
Let’s make an extra check of total number of invoices by day of the week. We will check the original dataset, to be sure nothing is missed (when cleaning the data). We will count unique invoices (invoices column originally hadn’t consist any missing values).
= df_ecom.groupby(['invoice_day_of_week', 'invoice_day_name'])['invoice_no'].nunique().reset_index()
daily_invoices_df
= daily_invoices_df.rename(columns={'invoice_no': 'unique_invoices'})
daily_invoices_df daily_invoices_df
invoice_day_of_week | invoice_day_name | unique_invoices | |
---|---|---|---|
0 | 0 | Monday | 2381 |
1 | 1 | Tuesday | 3960 |
2 | 2 | Wednesday | 4430 |
3 | 3 | Thursday | 4496 |
4 | 4 | Friday | 5353 |
5 | 5 | Saturday | 3824 |
# getting full list of day names, as we want to display the missing week day on the pie-chart in the next step
= list(calendar.day_name)
all_days = pd.DataFrame({'invoice_day_name': all_days})
all_days_df
# merging DataFrames to add the missing day into original DataFrame
= pd.merge(all_days_df, daily_invoices_df, on=['invoice_day_name'], how='left').fillna(0)
daily_invoices_df daily_invoices_df
invoice_day_name | invoice_day_of_week | unique_invoices | |
---|---|---|---|
0 | Monday | 0.00 | 2381.00 |
1 | Tuesday | 1.00 | 3960.00 |
2 | Wednesday | 2.00 | 4430.00 |
3 | Thursday | 3.00 | 4496.00 |
4 | Friday | 4.00 | 5353.00 |
5 | Saturday | 5.00 | 3824.00 |
6 | Sunday | 0.00 | 0.00 |
# creating a palette with red for zero values
= sns.color_palette('pastel', len(all_days))
base_colors = ['red' if invoice_no == 0 else color
colors for invoice_no, color in zip(daily_invoices_df['unique_invoices'], base_colors)] # pairing each 'invoice_no' value with corresponding color from the base palette.
# calculating percentages
= daily_invoices_df['unique_invoices'].sum()
total_invoices = daily_invoices_df['unique_invoices'] / total_invoices * 100
percentages
# creating a pie chart
=(6, 6))
plt.figure(figsize= plt.pie(
wedges, texts, autotexts
percentages,=all_days,
labels=lambda pct: f'{pct:.1f}%' if pct > 0 else '0.0%', # manually setting autopct (percentages in this case), otherwise zero value won't be displayed
autopct=90,
startangle=0.85,
pctdistance=colors)
colors
# setting red label and percentage for zero-value case
for i, (text, autotext) in enumerate(zip(texts, autotexts)):
if percentages[i] == 0:
'red')
text.set_color('red')
autotext.set_color(
'Distribution of Invoices by Day of Week (in the Original Dataset)', fontsize=14)
plt.title('Note: Percentages represent the proportion of invoices for each day.', xy=(0, -1.25), fontsize=10, style='italic', ha='center')
plt.annotate(#plt.tight_layout()
; plt.show()
Observations
Friday is the most efficient weekday in terms of quantity and revenue generation. It’s also the leader in daily number of orders and customers, and second (after Wednesday) by daily number of purchases. Interestingly Friday is displaying the highest median values across all parameters studied (entries, invoices, revenue, quantity, and unique customers). Notably, 22% of all purchases occur on Fridays (in the original not cleaned dataset).
In contrast, Monday is the least efficient week day, showing the lowest totals and median values of the same parameters. Monday stands out from the other weekdays with a significant gap. For instance, Monday Revenue performance is approximately three times lower than that of Friday (774k vs 2.0M totals and 12.2 vs 35.7 daily median values).
Thursday and Wednesday follow as the next most efficient days in terms of quantity and revenue. Wednesdays typically generate a slightly higher number of orders and revenue, while Thursdays show better results in number of orders and unique customers. Interestingly, Wednesdays slightly outperform Fridays in total number of orders (while median daily number of orders on Fridays is slightly higher, what suggest impact of several very strong Wednesdays).
Saturday and Tuesday go very close and rank lower across almost all parameters.
The ranges and interquartile ranges (IQRs) vary significantly from day to day and from parameter to parameter. Notably, Friday demonstrates the widest ranges and IQRs for almost all parameters except for the number of orders and purchases where it shares leadership with e.g. Thursday and Wednesday.
We observe no entries recorded on Sundays, which is unusual for an e-commerce business. To ensure the reliability of our conclusions, we verified this by checking the original unfiltered dataset.
It’s noteworthy that Saturday is not among the high-performing days, what could be expected from a holiday.
To ensure we haven’t missed any weeks, we will also examine the distribution of invoices by week. Given the higher number of data points compared to our monthly invoice analysis, we will utilize the Plotly visualization library. This will provide a more interactive and detailed view of our data.
# checking distribution of invoices by week
= df_ecom.groupby(['invoice_year_week'])['invoice_no'].nunique().reset_index().rename(columns={'invoice_no':'unique_invoices'})
weekly_invoices weekly_invoices.head()
invoice_year_week | unique_invoices | |
---|---|---|
0 | 2018-Week-48 | 376 |
1 | 2018-Week-49 | 690 |
2 | 2018-Week-50 | 595 |
3 | 2018-Week-51 | 239 |
4 | 2019-Week-01 | 252 |
# plotting a line plot of distribution distribution of invoices by week
= go.Figure()
fig
fig.add_trace(go.Scatter(=weekly_invoices['invoice_year_week'],
x=weekly_invoices['unique_invoices'],
y='lines+markers',
mode='navy',
line_color='Weekly Invoices'))
name
fig.update_layout( ={'text': 'Invoices by Week', 'font_size': 20, 'y': 0.9, 'x': 0.5},
title='Week',
xaxis_title='Invoices',
yaxis_title=900,
width=600,
height=dict(tickangle=-45))
xaxis
# adding markers highlighting peaks of orders
= ['2018-Week-49', '2019-Week-46']
peak_weeks = weekly_invoices[weekly_invoices['invoice_year_week'].isin(peak_weeks)]
peak_data
fig.add_trace(go.Scatter(=peak_data['invoice_year_week'],
x=peak_data['unique_invoices'],
y='markers',
mode=dict(color='green', size=100, symbol='circle-open',
marker=dict(color='green', width=1)),
line='Peak Weeks'))
name
for week in peak_weeks:
=week, line_color='green', line_width=1, line_dash='dash')
fig.add_vline(x
; fig.show()
Observations
The distribution of invoices by week is consistent without gaps like missed weeks. Despite some local fluctuations, there is an overall positive growth trend in the number of invoices over time.
💡 We observe two major peaks: One in the week 49 of 2018 (more than double the number of orders compared to the previous week 48 - 690 vs 376 invoices), the second a year later in weeks 45-48 of 2019, with the highest point in the week 46 (851 invoices).
These time periods are very likely connected with Black Friday sales events (which typically occur in late November and may extend to a longer promotional period). The broader peak in 2019 was likely due to an extended sales period, potentially including Cyber Monday promotions as well.
💡 This pattern demonstrates either the exceptional effectiveness of marketing campaigns during these major seasonal sales, the tendency of business-customers (that we’ve already studied at EDA stage) to take advantage of discounts and buy more these days, or a combination of both.
Above we studied parameters on different scales, not starting with zeros, with different amplitudes, so graphs interpretations may mislead when comparing dynamics separately.
Now we will study both absolute and relative changes of the main parameters and visualize these changes on the same graphs. We will again use the Plotly visualization library to provide a more interactive and detailed view of our data. We will build two plots: first will show show absolute changes - how much each parameter changed comparing to its starting value; The second plot will show relative changes, thus providing evident overview into periods of growth and decline of each parameter.
Note: Here we decided to plot mean (not median) values of unit price and invoice quantity and revenue for better tracking of overall trends, even with skewed data.
# calculating monthly change percentage for the total values and adding new columns
= ['revenue', 'quantity', 'unique_invoices', 'unique_products', 'unique_customers', 'unit_price_mean']
parameters
for parameter in parameters:
f'{parameter}_change_pct'] = monthly_summary[parameter].pct_change() * 100
monthly_summary[
# calculating changes relative to the first month and adding new columns
= {parameter: monthly_summary[parameter].iloc[0] for parameter in parameters}
first_month_values
for parameter in parameters:
f'{parameter}_absolute_change_pct'] = ((monthly_summary[parameter] - first_month_values[parameter]) / first_month_values[parameter]) * 100
monthly_summary[
monthly_summary
invoice_year_month | revenue | quantity | unique_invoices | entries | unique_products | unique_customers | unit_price_mean | unit_price_median | revenue_change_pct | quantity_change_pct | unique_invoices_change_pct | unique_products_change_pct | unique_customers_change_pct | unit_price_mean_change_pct | revenue_absolute_change_pct | quantity_absolute_change_pct | unique_invoices_absolute_change_pct | unique_products_absolute_change_pct | unique_customers_absolute_change_pct | unit_price_mean_absolute_change_pct | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2018-12 | 670676.20 | 299461 | 1282 | 35788 | 2736 | 769 | 3.86 | 2.55 | NaN | NaN | NaN | NaN | NaN | NaN | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1 | 2019-01 | 641890.68 | 338021 | 1205 | 36781 | 2602 | 806 | 3.35 | 2.10 | -4.29 | 12.88 | -6.01 | -4.90 | 4.81 | -13.39 | -4.29 | 12.88 | -6.01 | -4.90 | 4.81 | -13.39 |
2 | 2019-02 | 502201.30 | 277862 | 1071 | 26089 | 2396 | 745 | 3.56 | 2.46 | -21.76 | -17.80 | -11.12 | -7.92 | -7.57 | 6.53 | -25.12 | -7.21 | -16.46 | -12.43 | -3.12 | -7.74 |
3 | 2019-03 | 671649.94 | 373897 | 1411 | 34278 | 2495 | 950 | 3.45 | 2.10 | 33.74 | 34.56 | 31.75 | 4.13 | 27.52 | -3.30 | 0.15 | 24.86 | 10.06 | -8.81 | 23.54 | -10.78 |
4 | 2019-04 | 497476.19 | 293019 | 1179 | 27993 | 2440 | 826 | 3.32 | 2.08 | -25.93 | -21.63 | -16.44 | -2.20 | -13.05 | -3.72 | -25.82 | -2.15 | -8.03 | -10.82 | 7.41 | -14.10 |
5 | 2019-05 | 784946.06 | 416382 | 1744 | 38227 | 2516 | 1080 | 3.49 | 2.10 | 57.79 | 42.10 | 47.92 | 3.11 | 30.75 | 5.07 | 17.04 | 39.04 | 36.04 | -8.04 | 40.44 | -9.75 |
6 | 2019-06 | 659034.58 | 370107 | 1476 | 33526 | 2580 | 972 | 3.29 | 2.08 | -16.04 | -11.11 | -15.37 | 2.54 | -10.00 | -5.60 | -1.74 | 23.59 | 15.13 | -5.70 | 26.40 | -14.80 |
7 | 2019-07 | 722230.94 | 419026 | 1487 | 39748 | 2692 | 970 | 3.06 | 1.95 | 9.59 | 13.22 | 0.75 | 4.34 | -0.21 | -7.07 | 7.69 | 39.93 | 15.99 | -1.61 | 26.14 | -20.83 |
8 | 2019-08 | 754086.87 | 439459 | 1404 | 35297 | 2589 | 940 | 3.14 | 2.08 | 4.41 | 4.88 | -5.58 | -3.83 | -3.09 | 2.61 | 12.44 | 46.75 | 9.52 | -5.37 | 22.24 | -18.77 |
9 | 2019-09 | 963129.03 | 530912 | 1705 | 46410 | 2717 | 1215 | 3.06 | 2.08 | 27.72 | 20.81 | 21.44 | 4.94 | 29.26 | -2.61 | 43.61 | 77.29 | 33.00 | -0.69 | 58.00 | -20.89 |
10 | 2019-10 | 1165477.67 | 656282 | 2131 | 61167 | 2861 | 1431 | 3.10 | 2.08 | 21.01 | 23.61 | 24.99 | 5.30 | 17.78 | 1.49 | 73.78 | 119.15 | 66.22 | 4.57 | 86.09 | -19.71 |
11 | 2019-11 | 1484959.99 | 757586 | 2831 | 83027 | 2931 | 1673 | 3.10 | 2.08 | 27.41 | 15.44 | 32.85 | 2.45 | 16.91 | -0.12 | 121.41 | 152.98 | 120.83 | 7.13 | 117.56 | -19.80 |
# calculating monthly change percentage for the invoices mean and median values and adding new columns
= ['quantity_mean', 'revenue_mean']#, 'unit_price_median']
m_parameters
for m_parameter in m_parameters:
f'{m_parameter}_change_pct'] = monthly_invoices_summary[m_parameter].pct_change() * 100
monthly_invoices_summary[
# calculating changes relative to the first month and adding new columns
= {m_parameter: monthly_invoices_summary[m_parameter].iloc[0] for m_parameter in m_parameters}
m_first_month_values
for m_parameter in m_parameters:
f'{m_parameter}_absolute_change_pct'] = ((monthly_invoices_summary[m_parameter] - m_first_month_values[m_parameter]) / m_first_month_values[m_parameter]) * 100
monthly_invoices_summary[
monthly_invoices_summary
invoice_year_month | quantity_mean | quantity_median | revenue_mean | revenue_median | quantity_mean_change_pct | revenue_mean_change_pct | quantity_mean_absolute_change_pct | revenue_mean_absolute_change_pct | |
---|---|---|---|---|---|---|---|---|---|
0 | 2018-12 | 233.59 | 111.50 | 523.15 | 256.13 | NaN | NaN | 0.00 | 0.00 |
1 | 2019-01 | 280.52 | 146.00 | 532.69 | 303.80 | 20.09 | 1.82 | 20.09 | 1.82 |
2 | 2019-02 | 259.44 | 140.00 | 468.91 | 303.58 | -7.51 | -11.97 | 11.07 | -10.37 |
3 | 2019-03 | 264.99 | 140.00 | 476.01 | 291.44 | 2.14 | 1.51 | 13.44 | -9.01 |
4 | 2019-04 | 248.53 | 142.00 | 421.95 | 302.40 | -6.21 | -11.36 | 6.40 | -19.34 |
5 | 2019-05 | 238.75 | 141.00 | 450.08 | 303.50 | -3.94 | 6.67 | 2.21 | -13.97 |
6 | 2019-06 | 250.75 | 141.00 | 446.50 | 278.02 | 5.03 | -0.80 | 7.35 | -14.65 |
7 | 2019-07 | 281.79 | 163.00 | 485.70 | 302.18 | 12.38 | 8.78 | 20.64 | -7.16 |
8 | 2019-08 | 313.00 | 180.50 | 537.10 | 305.98 | 11.08 | 10.58 | 34.00 | 2.67 |
9 | 2019-09 | 311.39 | 193.00 | 564.89 | 324.14 | -0.52 | 5.17 | 33.30 | 7.98 |
10 | 2019-10 | 307.97 | 177.00 | 546.92 | 312.82 | -1.10 | -3.18 | 31.84 | 4.54 |
11 | 2019-11 | 267.60 | 156.00 | 524.54 | 295.14 | -13.11 | -4.09 | 14.56 | 0.27 |
# creating line plots - for each parameter's absolute change
# defining the colors
= {
colors 'revenue': 'darkred',
'quantity': 'teal',
'unique_invoices': 'navy',
'unique_products': 'purple',
'unique_customers': 'darkgreen',
'unit_price_mean': 'darkgoldenrod',
'unit_price_median': 'darkorange',
'revenue_mean': 'crimson',
'revenue_median': 'darkred',
'quantity_mean': 'darkseagreen',
'quantity_median': 'teal'}
= go.Figure()
fig
# adding traces
for parameter in parameters:
= colors.get(parameter, 'gray') # default to gray if parameter not in colors dict
color
fig.add_trace(go.Scatter(=monthly_summary['invoice_year_month'],
x=monthly_summary[f'{parameter}_absolute_change_pct'],
y='lines+markers',
mode=f'{parameter}',
name=dict(size=8, color=color),
marker=dict(width=2, color=color),
line='<b>%{x}</b><br>' +
hovertemplatef'Parameter: {parameter} Absolute Change<br>' +
'Value: %{y:.2f}%<extra></extra>' )) # hiding secondary box in hover labels
for m_parameter in m_parameters:
= colors.get(m_parameter, 'gray') # default to gray if parameter not in colors dict
color
fig.add_trace(go.Scatter(=monthly_invoices_summary['invoice_year_month'],
x=monthly_invoices_summary[f'{m_parameter}_absolute_change_pct'],
y='lines+markers',
mode=f'invoice_{m_parameter}',
name=dict(size=8, symbol='diamond', color=color),
marker=dict(width=2, dash='dot', color=color),
line='<b>%{x}</b><br>' +
hovertemplatef'Parameter: invoice_{m_parameter} Absolute Change<br>' +
'Value: %{y:.2f}%<extra></extra>')) # hiding secondary box in hover labels
# adding annotations for the milestones
= 0
milestone_number for milestone in ['2019-02','2019-08']:
+= 1
milestone_number = f'Milestone {milestone_number}'
milestone_title = datetime.strptime(milestone, '%Y-%m') - timedelta(days=5)
milestone_date
fig.add_annotation(=milestone_title,
text='y',
yref=milestone_date, y=140, textangle=-90,
x=False,
showarrow=dict(size=14, color='gray'))
font
fig.update_layout(={'text': 'Absolute Changes in Parameters by Month', 'font_size': 20,'y': 0.92, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'},
title='Month',
xaxis_title='Absolute Change (%)',
yaxis_title=-45,
xaxis_tickangle=dict(showgrid=True),
yaxis=True,
showlegend# legend={'y': 0.97, 'x': 0.03},
=900,
width=700)
height
=0, line_color='darkgray', line_width=2, line_dash='solid')
fig.add_hline(yfor milestone in ['2019-02','2019-08']:
=milestone, line_color='darkgray', line_width=2, line_dash='dash')
fig.add_vline(x; fig.show()
# creating line plots - for each parameter's relative changes
# defining the colors
= {
colors 'revenue': 'darkred',
'quantity': 'teal',
'unique_invoices': 'navy',
'unique_products': 'purple',
'unique_customers': 'darkgreen',
'unit_price_mean': 'darkgoldenrod',
'unit_price_median': 'darkorange',
'revenue_mean': 'crimson',
'revenue_median': 'darkred',
'quantity_mean': 'darkseagreen',
'quantity_median': 'teal'}
= go.Figure()
fig
# adding colored background regions
fig.add_hrect(=0, y1=70,
y0='rgba(209, 254, 184, 0.2)', # light green for growth period (change % above 0)
fillcolor='below',
layer=0)
line_width
fig.add_hrect(=-40, y1=0,
y0='rgba(255, 209, 220, 0.2)', # light red for decline period (change % below 0)
fillcolor='below',
layer=0)
line_width
# adding annotations for growth and decline periods
fig.add_annotation(='Growth Period',
text='paper', yref='y',
xref=0.5, y=65,
x=False,
showarrow=dict(size=14, color='darkgreen'))
font
fig.add_annotation(='Decline Period',
text='paper', yref='y',
xref=0.5, y=-35,
x=False,
showarrow=dict(size=14, color='darkred'))
font
# adding annotations for the milestones
= 0
milestone_number for milestone in ['2019-02','2019-08']:
+= 1
milestone_number = f'Milestone {milestone_number}'
milestone_title = datetime.strptime(milestone, '%Y-%m') - timedelta(days=5)
milestone_date
fig.add_annotation(=milestone_title,
text='y',
yref=milestone_date, y=55, textangle=-90,
x=False,
showarrow=dict(size=14, color='gray'))
font
# adding traces
for parameter in parameters:
= colors.get(parameter, 'gray') # default to gray if parameter not in colors dict
color
fig.add_trace(go.Scatter(=monthly_summary['invoice_year_month'],
x=monthly_summary[f'{parameter}_change_pct'],
y='lines+markers',
mode=f'{parameter}',
name=dict(size=8, color=color),
marker=dict(width=2, color=color),
line='<b>%{x}</b><br>' +
hovertemplatef'Parameter: {parameter} Relative Change <br>' +
'Value: %{y:.2f}%<extra></extra>')) # hiding secondary box in hover labels
for m_parameter in m_parameters:
= colors.get(m_parameter, 'gray') # using m_parameter instead of parameter
color
fig.add_trace(go.Scatter(=monthly_invoices_summary['invoice_year_month'],
x=monthly_invoices_summary[f'{m_parameter}_change_pct'],
y='lines+markers',
mode=f'invoice_{m_parameter}',
name=dict(size=8, color=color, symbol='diamond'),
marker=dict(width=2, color=color, dash='dot'),
line='<b>%{x}</b><br>' +
hovertemplatef'Parameter: invoice_{m_parameter} Relative Change <br>' +
'Value: %{y:.2f}%<extra></extra>')) # hiding secondary box in hover labels
# updating appearance
fig.update_layout(={'text': 'Relative Changes in Parameters by Month', 'font_size': 20, 'y': 0.92, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'},
title='Month',
xaxis_title='Relative Change (%)',
yaxis_title=-45,
xaxis_tickangle=dict(showgrid=True),
yaxis=True,
showlegend#legend={'y': 0.97, 'x': 0.03},
=1000,
width=700,
height='white')
paper_bgcolor
=0, line_color='darkgray', line_width=2, line_dash='solid')
fig.add_hline(yfor milestone in ['2019-02','2019-08']:
=milestone, line_color='darkgray', line_width=2, line_dash='dash')
fig.add_vline(x; fig.show()
Observations
💡 Our analysis reveals three distinct phases during the study period (each with its own characteristics and focus):
💡 Our analysis reveals two significant performance levers: price and products variety.
Unit price generally shows a strong inverse correlation with volume metrics, where a minor change of mean unit prices goes in parallel with a greater change of the other metrics. For instance ~7% growth of mean unit price aligns with ~18% decrease of quantity and ~22% decrease of revenue in February 2019; ~7% decrease of mean unit price aligns with ~12% increase of quantity and ~10% increase of revenue in July 2019. This indicates customers’ high price sensitivity.
Product assortment demonstrates a direct correlation with performance - typically, a 1% increase in unique products drives a 2-10% increase in revenue quantity sold and unique customers, the similar impact in case of decrease. For instance, ~4% growth in number of products goes together with ~28-35% growth of customers, revenue, quantity and invoices in March 2019, while ~2% decline in number of products aligns with ~13-26% decline of those parameters in April 2019. Except few months, products number dynamics over time are very similar to those of customers and invoices number, so product variety appears to be a critical driver of both customer acquisition and sales growth.
💡 Overall, the business revised products range and launched new products, shifting from correction with higher prices through experimentation phase very likely found its core niches and optimal products proposals to scaling phase, effectively utilizing price and product assortment as growth levers and also likely efficient promotions. As result through temporal drawdown the business succeeded to just within 12 months increase sales volume by ~153%, and revenue, invoices, customers base by ~118-121.
Note: we will perform the correlation analysis in the next step to verify our current conclusions.
A (Premium) | B (Standard) | C (Basic) | |
---|---|---|---|
X | AX Class | BX Class | CX Class |
🟡 | - Adjust pricing often - Use best-possible media content, detailed product info and customers’ feedback - Actively invest in marketing campaigns |
- Tune prices regularly - Ensure good enough media content and clear descriptions - Run occasional marketing campaigns |
- Minimal pricing adjustments - Basic descriptions - Low marketing efforts, consider as complementary purchases |
🟣 | - Focus on unique features and continuous improvement | - Update based on customer demands | - Keep it simple, only essentials |
Y | AY Class | BY Class | CY Class |
🟡 | - Adjust pricing based on seasonal demand - Launch exclusive seasonal promotions |
- Run limited-time promotions for niche markets - Market based on trends and demand shifts |
- Focus on wholesales and large seasonal sales |
🟣 | - Offer seasonal variations | - Tune to match seasonal trends | - Check whether are sold solely or in bigger purchases - Consider using them as complementary goods or withdrawing them |
Z | AZ Class | BZ Class | CZ Class |
🟡 | - Adjust prices on occasions - Focus on sales for high-value customers |
- Keep pricing flexible and consultative - Target niche customers |
- Depends on overall performance trends* |
🟣 | - Provide custom solutions based on customer needs | - Provide only low-effort custom solutions | - Depends on overall performance trends* |
# building a correlation matrix and heatmap
= df_ecom_filtered[['quantity','unit_price']].corr().round(2)
corr_matrix_qty_price =(8, 6))
plt.figure(figsize'Correlation Heatmap of Quantity and Unit Price', fontsize=16)
plt.title(
# avoid showing the duplicating data on the heatmap by creating a mask for hiding the upper triangle (here we create a new array with the same shape as corr_matrix filled with True values, where np.triu() creates an upper triangular matrix, which is set to True and the lower triangle to False).
= np.triu(np.ones_like(corr_matrix_qty_price))
hide_triangle_mask
=True, mask=hide_triangle_mask, cmap='RdYlGn', vmin=-1, vmax=1); sns.heatmap(corr_matrix_qty_price, annot
=(8, 6))
plt.figure(figsize=df_ecom_filtered, x='unit_price', y='quantity', alpha=0.5)
sns.scatterplot(data'Scatter Plot of Quantity and Unit Price', fontsize=16); plt.title(
Observations
We see a very weak negative relationship between quantity and unit price per entry, where the correlation is -0.09.
This suggests insignificant tendency for lower prices on larger purchases.
Let’s add a float representation of invoice_year_month
. This will allow us include months in our Pairplot analysis of invoice-grouped parameters, thus make it easier to detect seasonality effects.
Note: Alongside the total values of the parameters, we will also analyze the median unit price. We chose the median because it remains stable even in the presence of significant price fluctuations (making it more reliable for correlation analysis) and better reflects typical unit prices, given the skewness of our unit price distribution.
'invoice_year_month_float'] = (
monthly_summary['invoice_year_month']
monthly_summary[apply(lambda x: float(x[:4]) + (float(x[-2:]) - 0.1) / 12)
.round(2))
.
3) monthly_summary.head(
invoice_year_month | revenue | quantity | unique_invoices | entries | unique_products | unique_customers | unit_price_mean | unit_price_median | revenue_change_pct | quantity_change_pct | unique_invoices_change_pct | unique_products_change_pct | unique_customers_change_pct | unit_price_mean_change_pct | revenue_absolute_change_pct | quantity_absolute_change_pct | unique_invoices_absolute_change_pct | unique_products_absolute_change_pct | unique_customers_absolute_change_pct | unit_price_mean_absolute_change_pct | invoice_year_month_float | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2018-12 | 670676.20 | 299461 | 1282 | 35788 | 2736 | 769 | 3.86 | 2.55 | NaN | NaN | NaN | NaN | NaN | NaN | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2018.99 |
1 | 2019-01 | 641890.68 | 338021 | 1205 | 36781 | 2602 | 806 | 3.35 | 2.10 | -4.29 | 12.88 | -6.01 | -4.90 | 4.81 | -13.39 | -4.29 | 12.88 | -6.01 | -4.90 | 4.81 | -13.39 | 2019.08 |
2 | 2019-02 | 502201.30 | 277862 | 1071 | 26089 | 2396 | 745 | 3.56 | 2.46 | -21.76 | -17.80 | -11.12 | -7.92 | -7.57 | 6.53 | -25.12 | -7.21 | -16.46 | -12.43 | -3.12 | -7.74 | 2019.16 |
= monthly_summary[['revenue', 'quantity', 'unique_invoices', 'unit_price_median', 'unique_products', 'unique_customers', 'invoice_year_month_float']] monthly_summary_corr
# building a correlation matrix and heatmap
= monthly_summary_corr.corr().round(2)
corr_matrix_monthly_summary =(10, 8))
plt.figure(figsize'Correlation Heatmap of Parameters Grouped by Month', fontsize=16)
plt.title(
# avoiding showing the duplicating data on the heatmap
= np.triu(np.ones_like(corr_matrix_monthly_summary))
hide_triangle_mask
=hide_triangle_mask, annot=True, cmap='RdYlGn', vmin=-1, vmax=1, linewidths=0.7); sns.heatmap(corr_matrix_monthly_summary, mask
# plotting a pairplot
=(18, 18))
plt.figure(figsize= sns.pairplot(monthly_summary_corr, diag_kind='kde')
fig 'Pairplot of Parameters by Month', y=1.02, fontsize=16)
plt.suptitle(
# avoiding scientific notation on axes
for ax in fig.axes.flat:
=False, useMathText=False))
ax.xaxis.set_major_formatter(ScalarFormatter(useOffset=False, useMathText=False))
ax.yaxis.set_major_formatter(ScalarFormatter(useOffset='plain', axis='both')
ax.ticklabel_format(style; plt.tight_layout()
<Figure size 1800x1800 with 0 Axes>
Observations
Both the heatmap and pairplot indicate a high degree of linear correlation among factors driving revenue, such as quantity, invoices, unique products, and unique customers.
The temporal variable invoice_year_month_float
significantly influences revenue, quantity, and other metrics, suggesting the impact of seasonality.
An upward trend is observed in most metrics over time, indicating a positive correlation and non-linear growth.
💡 The most valuable insight is the strong influence unique products and unique customers have on growth factors, such as quantity, revenue, and invoice volume, where:
💡 These strong correlations suggest that expanding the product range and customer base have been the key drivers of business growth (thus proving our observations during the Time-based analysis stage)
The non-linear growth over time may be explained by the non-linear growth of both the product assortment and customer base, along with seasonal factors and marketing campaigns.
The heatmap reveals negative correlations between median unit price and all growth metrics, most notably with quantity (-0.43), unique customers (-0.43) and invoice months (-0.62)
The weak negative correlation (-0.17) between median unit price and unique products suggests that product range expansion favored lower-priced products.
💡💡 These findings complement earlier observations that product range and customer base expansion are key growth drivers, where general price reduction trend was a contributing factor to this growth.
Let’s add a float representation of invoice_year_month
. This will allow us to include months in our correlation analysis of invoice-grouped parameters, helping detect influence of seasonality.
For better identification of seasonal influences we will use monthly median values of parameters grouped by invoices.
'invoice_year_month_float'] = (
monthly_invoices_summary['invoice_year_month']
monthly_invoices_summary[apply(lambda x: float(x[:4]) + (float(x[-2:]) - 0.1) / 12)
.round(2))
.
3) monthly_invoices_summary.head(
invoice_year_month | quantity_mean | quantity_median | revenue_mean | revenue_median | quantity_mean_change_pct | revenue_mean_change_pct | quantity_mean_absolute_change_pct | revenue_mean_absolute_change_pct | invoice_year_month_float | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2018-12 | 233.59 | 111.50 | 523.15 | 256.13 | NaN | NaN | 0.00 | 0.00 | 2018.99 |
1 | 2019-01 | 280.52 | 146.00 | 532.69 | 303.80 | 20.09 | 1.82 | 20.09 | 1.82 | 2019.08 |
2 | 2019-02 | 259.44 | 140.00 | 468.91 | 303.58 | -7.51 | -11.97 | 11.07 | -10.37 | 2019.16 |
= monthly_invoices_summary[['quantity_median', 'revenue_median', 'invoice_year_month_float']] #'unit_price_median' monthly_invoices_summary_corr
# building a correlation matrix and heatmap
= monthly_invoices_summary_corr.corr().round(2)
corr_matrix_by_invoice_month
=(10, 8))
plt.figure(figsize'Correlation Heatmap of Invoice Quantity and Revenue by Month', fontsize=16)
plt.title(
# avoid showing the duplicating data on the heatmap
= np.triu(np.ones_like(corr_matrix_by_invoice_month))
hide_triangle_mask
=hide_triangle_mask, annot=True, cmap='RdYlGn', vmin=-1, vmax=1, linewidths=0.7); sns.heatmap(corr_matrix_by_invoice_month, mask
# plotting a pairplot
= sns.pairplot(monthly_invoices_summary_corr, diag_kind='kde')
fig 'Pairplot of Invoice Quantity and Revenue by Month', y=1.02, fontsize=16)
plt.suptitle(
# avoiding scientific notation on axes
for ax in fig.axes.flat:
=False, useMathText=False))
ax.xaxis.set_major_formatter(ScalarFormatter(useOffset=False, useMathText=False))
ax.yaxis.set_major_formatter(ScalarFormatter(useOffset='plain', axis='both')
ax.ticklabel_format(style; plt.tight_layout()
Observations
We see a strong, approximately linear relationship between median invoice revenue and median invoice quantity, with correlation of 0.81. This is expected and confirms that revenue generally increases with quantity sold.
The relationships of both median invoice quantity and median invoice revenue with invoice year-month are non-linear, but show an overall positive trend with high fluctuations:
# building a correlation matrix and heatmap
= daily_summary_12m.drop(['invoice_day','invoice_day_name'], axis=1).corr().round(2)
corr_matrix_by_day
=(10, 8))
plt.figure(figsize'Correlation Heatmap of Parameters Grouped by Day of Week', fontsize=16)
plt.title(
# avoid showing the duplicating data on the heatmap
= np.triu(np.ones_like(corr_matrix_by_day))
hide_triangle_mask
=hide_triangle_mask, annot=True, cmap='RdYlGn', vmin=-1, vmax=1, linewidths=0.7); sns.heatmap(corr_matrix_by_day, mask
# plotting a pairplot
= sns.pairplot(daily_summary_12m.drop(['invoice_day','invoice_day_name'], axis=1), diag_kind='kde')
fig 'Pairplot of Parameters Grouped by Day of Week', y=1.02, fontsize=16); plt.suptitle(
Observations
The data grouped by day of week shows a high degree of linear correlation between invoices, revenue, and quantity (correlations from 0.75 to 0.93), mirroring the patterns observed in our previous analyses.
The day of the week influences the key parameters:
💡 These insights numerically confirms our previous assumptions that certain days demonstrate larger number of orders and slightly more unique customers. According to our previous Time-based Analysis, we know that these high-performing days are mostly Fridays and Thursdays. This activity can be connected with extra free time for shopping towards the end of the week and/or effective promotions run at the end of the week.
Note: we lack data on Sunday sales, which may affect current assumptions, especially if weekend shopping behavior differs significantly from weekdays.
General Overview
While substantial work contributing to PRA has been already completed during the EDA stage, at this project stage, we aim to gain a deeper understanding of the performance of different products and categories.
Given the complexity of classifying products based on keywords, we have chosen to implement the ABC-XYZ analysis method, which categorizes products based on their value contribution and demand variability.
ABC-XYZ Analysis Overview
ABC analysis categorizes products based on their value contribution (we’ve chosen revenue parameter in our case), classifying them into A (high-value), B (moderate-value), and C (low-value) groups.
XYZ analysis complements this by evaluating sales predictability, with X products (being highly stable), Y (having moderate variability), and Z (being unpredictable).
Combining ABC and XYZ analyses provides both understanding of product range performance and inventory management aspects (for instance, it enhances stock management, as we consider both consumption and demand volatility). It is also efficient for focusing on the most valuable products that generate the major revenue, and considering removal for less successful ones. Having said that, we can conclude that combined ABC-XYZ analysis strongly relates to our project objective.
Note: Basically ABC method categorizes products based on their revenue contribution, following the Pareto principle. It assigns products to Class A (top 80% of revenue), Class B (next 10%), and Class C (remaining 10%). Meanwhile the weights of classes and even their number should be treated as a guideline, rather than a mandatory rule. For more precise analysis we may tailor the method to our specific business needs and particular product range.
The data we base our study on
Note 1: **By returns we consider only the negative part of mutually exclusive entries**. Since if we consider all the negative quantity entries, for example discounts and manual corrections, this may spoil our analysis, as such operations are of different nature*.
Note 2: **We will define new products as those having sales within the last three months but none before.
Note 3: The RFM (Recency, Frequency, Monetary) analysis was also considered for PRA as an alternative to the ABC-XYZ method. However, since RFM analysis is primarily designed to segment customers based on purchasing behavior and loyalty, it appears less suited to product performance evaluation. In contrast, the ABC-XYZ analysis method directly targets product performance, making it more appropriate for the focus of this project.
*Note 1: If requested, we can make our ABC-XYZ analysis more complex by adding additional criteria (enhancing ABC analysis), e.g., quantity of products sold and number of invoices with a certain product. For instance, in such a matrix, products classified as AAAZ would be those generating high revenues, selling in large quantities, and frequently appearing in invoices but with unstable sales patterns. This modification can allow more precise tuning of marketing and inventory policies and action plans.
Preview
Let’s recollect the findings we have gained so far: The share of all entries with negative quantity is almost twice higher than the share of returns from mutually exclusive entries (cases where the same customer bought and returned the same product): 8.4% against 4.4% by quantity and 9.2% against 4.7% respectively. This difference can be explained by discounts, manual corrections, and extra fees and charges from marketplaces and banks. In this part of the study we will focus on returns only, as the other entries representing negative quantities had been already studied before.
The general goal
At this study we aim to explore the characteristics of returns:
Furthermore, we will establish a classification system for return. This will allow us to integrate return characteristics into our ABC-XYZ analysis, providing a more comprehensive view of product performance.
Before studying top returned products and seasonal patterns, we will again provide overall returns figures to demonstrate their scale.
Parameters to study
*Note: The “Return rate” parameter may seem far less valuable “Returns Loss Rate” parameter, that represents direct financial and inventory impact. Meanwhile, it is substantial for the PRA. Even if the monetary value of returns is low, a high frequency of returns can significantly impact operational costs.
Also a high share of entries with returns could indicate issues with product descriptions, quality, or customer expectations. We can sacrifice low-value products (according to ABC_XYZ matrix), that also represent a high share of entries with returns, meanwhile those high-value products, even having high return rates, should be analyzed more precisely - not simply taken out of assortment. They have already proved to be attractive for customers and profitable for the business, and careful examination of customers’ feedback can reveal a clue on the issues, e.g. with description or features malfunction, that probably could be fixed by suppliers.
Methods of study
share_evaluation
function will be handy here as well.Preview
As we revealed at the EDA stage, increasing products assortment is one of key the drivers of business growth (for both revenue and volume of sales). That makes this study valuable. It is essential to acknowledge that new products may be underestimated and misclassified due to their short sales track. This analysis aims to provide a clearer understanding of their performance within the overall dataset. We will flag new products in our ABC-XYZ analysis, recognizing that they may represent a substantial part of our total offerings. Additionally, we will study these products separately to gain deeper insights into their characteristics and contributions.
The general goal
In this study, we aim to explore the characteristics of new products:
Furthermore, we will establish a classification system for new products. This will allow us to integrate their characteristics into our ABC-XYZ analysis, providing a more comprehensive view of product performance.
Before studying top performing new products and sales patterns, we will present overall figures for new products to demonstrate their scale and impact.
Parameters to study
*Note: The “Sales Volume” parameter may seem less valuable than “Revenue Contribution,” which directly reflects financial impact. However, it is crucial for evaluating business growth. Even if the financial value is low, a high volume of sales can indicate strong customer interest snd efficieent marketing activities.
Additionally, a high share of entries involving new products could highlight issues with product visibility or marketing strategies. We may consider discontinuing low-performing new products while closely analyzing those with high revenue contributions but lower sales volumes. These products may still hold potential if supported by effective marketing or adjustments based on customer feedback.
Methods of Study
share_evaluation
function will be useful here as well.Let’s examine the ABC-XYZ matrix in terms of consumption levels and demand stability.
Here we will describe the main characteristics of each class and provide an approach how to address them in terms of both inventory management and business development.
Note: The description of the inventory approach toward the ABC-XYZ matrix is based on information provided by the Association of International Certified Professional Accountants
Note: In frames of this study we’ve chosen revenue generation as a criterion for products evaluation in frames of ABC analysis.*
Inventory Management
With different colors in the matrix above, we present inventory management policies that may include:
Business Development
Let’s define business development policies for each class, dividing them into two key areas: - 🟡 Marketing and sales - 🟣 Product development
A (Premium) | B (Standard) | C (Basic) | |
---|---|---|---|
X | AX Class | BX Class | CX Class |
🟡 | - Adjust pricing often - Use best-possible media content, detailed product info and customers’ feedback - Actively invest in marketing campaigns |
- Tune prices regularly - Ensure good enough media content and clear descriptions - Run occasional marketing campaigns |
- Minimal pricing adjustments - Basic descriptions - Low marketing efforts, consider as complementary purchases |
🟣 | - Focus on unique features and continuous improvement | - Update based on customer demands | - Keep it simple, only essentials |
Y | AY Class | BY Class | CY Class |
🟡 | - Adjust pricing based on seasonal demand - Launch exclusive seasonal promotions |
- Run limited-time promotions for niche markets - Market based on trends and demand shifts |
- Focus on wholesales and large seasonal sales |
🟣 | - Offer seasonal variations | - Tune to match seasonal trends | - Check whether are sold solely or in bigger purchases - Consider using them as complementary goods or withdrawing them |
Z | AZ Class | BZ Class | CZ Class |
🟡 | - Adjust prices on occasions - Focus on sales for high-value customers |
- Keep pricing flexible and consultative - Target niche customers |
- Depends on overall performance trends* |
🟣 | - Provide custom solutions based on customer needs | - Provide only low-effort custom solutions | - Depends on overall performance trends* |
Let’s calculate summary for each stock code.
= df_ecom_filtered_12m.groupby(['stock_code_description']).agg(
df_ecom_summary_12m = ('quantity', 'sum'),
quantity = ('revenue', 'sum'),
revenue = 'revenue', ascending=False).reset_index()
).sort_values(by
df_ecom_summary_12m
stock_code_description | quantity | revenue | |
---|---|---|---|
0 | 22423__REGENCY CAKESTAND 3 TIER | 13157 | 165414.75 |
1 | 85123A__WHITE HANGING HEART T-LIGHT ... | 36221 | 100641.99 |
2 | 47566__PARTY BUNTING | 18195 | 98828.59 |
3 | 85099B__JUMBO BAG RED RETROSPOT | 47304 | 92101.20 |
4 | 23084__RABBIT NIGHT LIGHT | 27349 | 59266.78 |
... | ... | ... | ... |
3905 | 84201C__HAPPY BIRTHDAY CARD TEDDY/CAKE | 5 | 0.95 |
3906 | 90084__PINK CRYSTAL GUITAR PHONE CHARM | 1 | 0.85 |
3907 | 51014c__FEATHER PEN,COAL BLACK | 1 | 0.83 |
3908 | 84227__HEN HOUSE W CHICK IN NEST | 1 | 0.42 |
3909 | PADS__PADS TO MATCH ALL CUSHIONS | 3 | 0.00 |
3910 rows × 3 columns
Next let’s calculate ABC classes. To proceed we need the revenue for all stock codes and the cumulative percentage of revenue each stock code contributes. The stock codes must be sorted by revenue in descending order as we did above. We can then use the cumsum()
function to calculate the cumulative revenue and its running percentage, storing these in the DataFrame.
'revenue_cum_sum'] = df_ecom_summary_12m['revenue'].cumsum()
df_ecom_summary_12m['revenue_total'] = df_ecom_summary_12m['revenue'].sum()
df_ecom_summary_12m['revenue_cum_pct'] = (df_ecom_summary_12m['revenue_cum_sum'] / df_ecom_summary_12m['revenue_total']) * 100
df_ecom_summary_12m[ df_ecom_summary_12m.head()
stock_code_description | quantity | revenue | revenue_cum_sum | revenue_total | revenue_cum_pct | |
---|---|---|---|---|---|---|
0 | 22423__REGENCY CAKESTAND 3 TIER | 13157 | 165414.75 | 165414.75 | 9517759.45 | 1.74 |
1 | 85123A__WHITE HANGING HEART T-LIGHT ... | 36221 | 100641.99 | 266056.74 | 9517759.45 | 2.80 |
2 | 47566__PARTY BUNTING | 18195 | 98828.59 | 364885.33 | 9517759.45 | 3.83 |
3 | 85099B__JUMBO BAG RED RETROSPOT | 47304 | 92101.20 | 456986.53 | 9517759.45 | 4.80 |
4 | 23084__RABBIT NIGHT LIGHT | 27349 | 59266.78 | 516253.31 | 9517759.45 | 5.42 |
We will create a function to assign products to classes based on their revenue contribution. For instance, stock codes generating the top 80% of revenue are class A, the next 10% are Class B, and the remainder are Class C.
def abc_classification(revenue_cum_pct):
"""
The function assigns a product to an ABC class based on its percentage revenue contribution.
Input:
percentage (float): the cumulative percentage of revenue contributed by the product.
Output:
str: 'A', 'B', or 'C' indicating the ABC class based on the provided thresholds:
- 'A' for the top 80% revenue contributors
- 'B' for the next 10% revenue contributors
- 'C' for the remaining revenue contributors
----------------
Note: This classification method follows the Pareto principle, where the majority of revenue is typically generated by a small proportion of products (Class A), what is not always the case.
----------------
"""
if revenue_cum_pct > 0 and revenue_cum_pct <= 80:
return 'A'
elif revenue_cum_pct > 80 and revenue_cum_pct <= 90:
return 'B'
else:
return 'C'
Let’s apply the abc_classification()
function above and assign the abc_class
value to the DataFrame.
'abc_class'] = df_ecom_summary_12m['revenue_cum_pct'].apply(abc_classification)
df_ecom_summary_12m[3) df_ecom_summary_12m.head(
stock_code_description | quantity | revenue | revenue_cum_sum | revenue_total | revenue_cum_pct | abc_class | |
---|---|---|---|---|---|---|---|
0 | 22423__REGENCY CAKESTAND 3 TIER | 13157 | 165414.75 | 165414.75 | 9517759.45 | 1.74 | A |
1 | 85123A__WHITE HANGING HEART T-LIGHT ... | 36221 | 100641.99 | 266056.74 | 9517759.45 | 2.80 | A |
2 | 47566__PARTY BUNTING | 18195 | 98828.59 | 364885.33 | 9517759.45 | 3.83 | A |
# creating a `df_abc`DataFrame, summarizing the main parameters
= df_ecom_summary_12m.groupby('abc_class').agg(
df_abc =('stock_code_description', 'nunique'),
unique_products=('quantity', 'sum'),
quantity=('revenue', 'sum'),
revenue
).reset_index()
# calculating shares of totals of each group for revenue and product range
'revenue_pct'] = round(df_abc['revenue'] / df_abc['revenue'].sum(), 2)
df_abc['products_pct'] = round(df_abc['unique_products'] / df_abc['unique_products'].sum(), 2)
df_abc[ df_abc
abc_class | unique_products | quantity | revenue | revenue_pct | products_pct | |
---|---|---|---|---|---|---|
0 | A | 842 | 3500580 | 7611955.54 | 0.80 | 0.22 |
1 | B | 510 | 744039 | 953294.95 | 0.10 | 0.13 |
2 | C | 2558 | 927395 | 952508.96 | 0.10 | 0.65 |
# calculating number of stock codes by ABC Class
= plt.subplots(figsize=(5, 3))
ax = sns.barplot(x='abc_class',
ax ='unique_products',
y=df_abc,
data='RdYlGn_r')\
palette'Number of Products by ABC Class', fontsize=14) .set_title(
# calculating quantity of Units by ABC Class
= plt.subplots(figsize=(5, 3))
ax = sns.barplot(x='abc_class',
ax ='quantity',
y=df_abc,
data='RdYlGn_r')\
palette
'Number of Products by ABC Class', fontsize=14)
ax.set_title(
# setting y-axis to display numbers in non-scientific format
'{x:,.0f}')); ax.yaxis.set_major_formatter(ticker.StrMethodFormatter(
# calculating revenue by ABC Class
= plt.subplots(figsize=(5, 3))
ax = sns.barplot(x='abc_class',
ax ='revenue',
y=df_abc,
data='RdYlGn_r')
palette'Revenue by ABC Class', fontsize=14)
ax.set_title(
# setting y-axis to display numbers in non-scientific format
'{x:,.0f}')); ax.yaxis.set_major_formatter(ticker.StrMethodFormatter(
In addition, let’s make a bubble chart that shows together both total quantity and total revenue by ABC Class. We will use plotly’s visualization library to make it more interactive.
# plotting a bubble chart of ABC analysis
= px.scatter(
fig
df_abc,='revenue',
x='quantity',
y='revenue',
size='revenue',
color='RdYlGn',
color_continuous_scale='abc_class',
hover_name='abc_class',
text='ABC Analysis Bubble Chart of Quantity vs. Revenue')
title
fig.update_layout(=600,
height=600,
width=0.5,
title_x=0.9)
title_y='middle left')
fig.update_traces(textposition
; fig.show()
We will calculate a coefficient of variation (CoV) of quantity for each product and assign appropriate classes. Let’s define what these classes represent:
We will implement a function that assigns the appropriate class to each product based on its cov_quantity
value, following the established XYZ classification rules.
Firstly, we need to reformat the data so the monthly data for each stock code is present inside the DataFrame.
= df_ecom_filtered_12m.groupby(['stock_code_description','invoice_year_month'])['quantity'].sum().reset_index()
df_products_monthly_quantity_12m df_products_monthly_quantity_12m.head()
stock_code_description | invoice_year_month | quantity | |
---|---|---|---|
0 | 10002__INFLATABLE POLITICAL GLOBE | 2018-12 | 190 |
1 | 10002__INFLATABLE POLITICAL GLOBE | 2019-01 | 340 |
2 | 10002__INFLATABLE POLITICAL GLOBE | 2019-02 | 54 |
3 | 10002__INFLATABLE POLITICAL GLOBE | 2019-03 | 146 |
4 | 10002__INFLATABLE POLITICAL GLOBE | 2019-04 | 69 |
Let’s place each product on its own line and store the number of units sold for each month in the separate column.
= (
df_products_monthly_quantity_12m_t ='stock_code_description', columns='invoice_year_month', values='quantity')
df_products_monthly_quantity_12m.pivot(index# .add_prefix('m_')
.reset_index()0))
.fillna(
3) df_products_monthly_quantity_12m_t.head(
invoice_year_month | stock_code_description | 2018-12 | 2019-01 | 2019-02 | 2019-03 | 2019-04 | 2019-05 | 2019-06 | 2019-07 | 2019-08 | 2019-09 | 2019-10 | 2019-11 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10002__INFLATABLE POLITICAL GLOBE | 190.00 | 340.00 | 54.00 | 146.00 | 69.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1 | 10080__GROOVY CACTUS INFLATABLE | 0.00 | 0.00 | 2.00 | 0.00 | 0.00 | 0.00 | 60.00 | 60.00 | 24.00 | 60.00 | 30.00 | 67.00 |
2 | 10120__DOGGY RUBBER | 16.00 | 0.00 | 30.00 | 28.00 | 0.00 | 3.00 | 0.00 | 10.00 | 30.00 | 10.00 | 11.00 | 48.00 |
Let’s calculate standard deviation in demand (for data integrity we will call it std_quantity
). Using a subset of the month columns, we can append .std(axis=1)
to calculate the standard deviation of each row’s values, and assign it back to the DataFrame.
# extracting columns with months
= [column for column in df_products_monthly_quantity_12m_t.columns
year_month_columns_12m if re.match(r'\d{4}-\d{2}', column)]
year_month_columns_12m
['2018-12',
'2019-01',
'2019-02',
'2019-03',
'2019-04',
'2019-05',
'2019-06',
'2019-07',
'2019-08',
'2019-09',
'2019-10',
'2019-11']
'std_quantity'] = df_products_monthly_quantity_12m_t[year_month_columns_12m].std(axis=1) df_products_monthly_quantity_12m_t[
3) df_products_monthly_quantity_12m_t.head(
invoice_year_month | stock_code_description | 2018-12 | 2019-01 | 2019-02 | 2019-03 | 2019-04 | 2019-05 | 2019-06 | 2019-07 | 2019-08 | 2019-09 | 2019-10 | 2019-11 | std_quantity |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10002__INFLATABLE POLITICAL GLOBE | 190.00 | 340.00 | 54.00 | 146.00 | 69.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 107.66 |
1 | 10080__GROOVY CACTUS INFLATABLE | 0.00 | 0.00 | 2.00 | 0.00 | 0.00 | 0.00 | 60.00 | 60.00 | 24.00 | 60.00 | 30.00 | 67.00 | 28.79 |
2 | 10120__DOGGY RUBBER | 16.00 | 0.00 | 30.00 | 28.00 | 0.00 | 3.00 | 0.00 | 10.00 | 30.00 | 10.00 | 11.00 | 48.00 | 15.35 |
Our next step is to calculate the sum of all the monthly data in order to determine the total quantity.
'quantity'] = df_products_monthly_quantity_12m_t[year_month_columns_12m].sum(axis=1)
df_products_monthly_quantity_12m_t[3) df_products_monthly_quantity_12m_t.head(
invoice_year_month | stock_code_description | 2018-12 | 2019-01 | 2019-02 | 2019-03 | 2019-04 | 2019-05 | 2019-06 | 2019-07 | 2019-08 | 2019-09 | 2019-10 | 2019-11 | std_quantity | quantity |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10002__INFLATABLE POLITICAL GLOBE | 190.00 | 340.00 | 54.00 | 146.00 | 69.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 107.66 | 799.00 |
1 | 10080__GROOVY CACTUS INFLATABLE | 0.00 | 0.00 | 2.00 | 0.00 | 0.00 | 0.00 | 60.00 | 60.00 | 24.00 | 60.00 | 30.00 | 67.00 | 28.79 | 303.00 |
2 | 10120__DOGGY RUBBER | 16.00 | 0.00 | 30.00 | 28.00 | 0.00 | 3.00 | 0.00 | 10.00 | 30.00 | 10.00 | 11.00 | 48.00 | 15.35 | 186.00 |
By dividing the quantity
column value by 12 months in the dataset, we will calculate the average quantity per stock code over the year.
'avg_quantity'] = df_products_monthly_quantity_12m_t['quantity'] / 12
df_products_monthly_quantity_12m_t[3) df_products_monthly_quantity_12m_t.head(
invoice_year_month | stock_code_description | 2018-12 | 2019-01 | 2019-02 | 2019-03 | 2019-04 | 2019-05 | 2019-06 | 2019-07 | 2019-08 | 2019-09 | 2019-10 | 2019-11 | std_quantity | quantity | avg_quantity |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10002__INFLATABLE POLITICAL GLOBE | 190.00 | 340.00 | 54.00 | 146.00 | 69.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 107.66 | 799.00 | 66.58 |
1 | 10080__GROOVY CACTUS INFLATABLE | 0.00 | 0.00 | 2.00 | 0.00 | 0.00 | 0.00 | 60.00 | 60.00 | 24.00 | 60.00 | 30.00 | 67.00 | 28.79 | 303.00 | 25.25 |
2 | 10120__DOGGY RUBBER | 16.00 | 0.00 | 30.00 | 28.00 | 0.00 | 3.00 | 0.00 | 10.00 | 30.00 | 10.00 | 11.00 | 48.00 | 15.35 | 186.00 | 15.50 |
Finally, we can calculate the amount of variation seen in quantity for each stock code across the year. This is the mean quantity divided by the standard deviation in quantity, which is the calculation of the Coefficient of Variation or CoV. Where a value closer to zero implies that the variation is minimal and the predictability is high and vice versa - high CoV values stand for the opposite.
'cov_quantity'] = df_products_monthly_quantity_12m_t['std_quantity'] / df_products_monthly_quantity_12m_t['avg_quantity']
df_products_monthly_quantity_12m_t[
3)
df_products_monthly_quantity_12m_t.head('cov_quantity'].describe() df_products_monthly_quantity_12m_t[
invoice_year_month | stock_code_description | 2018-12 | 2019-01 | 2019-02 | 2019-03 | 2019-04 | 2019-05 | 2019-06 | 2019-07 | 2019-08 | 2019-09 | 2019-10 | 2019-11 | std_quantity | quantity | avg_quantity | cov_quantity |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10002__INFLATABLE POLITICAL GLOBE | 190.00 | 340.00 | 54.00 | 146.00 | 69.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 107.66 | 799.00 | 66.58 | 1.62 |
1 | 10080__GROOVY CACTUS INFLATABLE | 0.00 | 0.00 | 2.00 | 0.00 | 0.00 | 0.00 | 60.00 | 60.00 | 24.00 | 60.00 | 30.00 | 67.00 | 28.79 | 303.00 | 25.25 | 1.14 |
2 | 10120__DOGGY RUBBER | 16.00 | 0.00 | 30.00 | 28.00 | 0.00 | 3.00 | 0.00 | 10.00 | 30.00 | 10.00 | 11.00 | 48.00 | 15.35 | 186.00 | 15.50 | 0.99 |
count 3910.00
mean 1.47
std 0.85
min 0.15
25% 0.81
50% 1.29
75% 1.91
max 3.46
Name: cov_quantity, dtype: float64
Let’s check the distribution of CoV (cov_quantity
) and its’ descriptive statistics. Once again, our distribution_IQR
function appears handy for that.
# checking distribution of quantity coefficient of variation (`cov_quantity`) + its' descriptive statistics
'cov_quantity', x_limits=[0,5], title_extension='', bins=[25, 100], outliers_info=False) distribution_IQR(df_products_monthly_quantity_12m_t,
==================================================
Statistics on cov_quantity
in df_products_monthly_quantity_12m_t
count 3910.00
mean 1.47
std 0.85
min 0.15
25% 0.81
50% 1.29
75% 1.91
max 3.46
Name: cov_quantity, dtype: float64
--------------------------------------------------
The distribution is moderately skewed to the right
(skewness: 0.9)
Note: outliers affect skewness calculation
==================================================
Observations
'cov_quantity > 3.3')['cov_quantity'].value_counts() df_products_monthly_quantity_12m_t.query(
cov_quantity
3.46 137
3.46 76
3.46 21
3.46 11
3.46 6
3.46 4
3.46 3
3.36 1
3.34 1
3.39 1
3.34 1
3.46 1
3.33 1
3.32 1
3.41 1
3.43 1
3.40 1
3.32 1
Name: count, dtype: int64
'cov_quantity >= 3.3') df_products_monthly_quantity_12m_t.query(
invoice_year_month | stock_code_description | 2018-12 | 2019-01 | 2019-02 | 2019-03 | 2019-04 | 2019-05 | 2019-06 | 2019-07 | 2019-08 | 2019-09 | 2019-10 | 2019-11 | std_quantity | quantity | avg_quantity | cov_quantity |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
38 | 16043__POP ART PUSH DOWN RUBBER | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 98.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 28.29 | 98.00 | 8.17 | 3.46 |
45 | 16151A__FLOWERS HANDBAG blue and orange | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 49.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 14.15 | 49.00 | 4.08 | 3.46 |
57 | 16169N__WRAP BLUE RUSSIAN FOLKART | 25.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 7.22 | 25.00 | 2.08 | 3.46 |
58 | 16169P__WRAP GREEN RUSSIAN FOLKART | 50.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 14.43 | 50.00 | 4.17 | 3.46 |
60 | 16202B__PASTEL BLUE PHOTO ALBUM | 0.00 | 0.00 | 0.00 | 0.00 | 29.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 8.37 | 29.00 | 2.42 | 3.46 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3823 | 90187A__BLUE DROP EARRINGS W BEAD CL... | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.29 | 1.00 | 0.08 | 3.46 |
3898 | DCGS0004__HAYNES CAMPER SHOULDER BAG | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.29 | 1.00 | 0.08 | 3.46 |
3899 | DCGS0069__OOH LA LA DOGS COLLAR | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.29 | 1.00 | 0.08 | 3.46 |
3900 | DCGS0070__CAMOUFLAGE DOG COLLAR | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.29 | 1.00 | 0.08 | 3.46 |
3901 | DCGS0076__SUNJAR LED NIGHT NIGHT LIGHT | 3.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.87 | 3.00 | 0.25 | 3.46 |
269 rows × 17 columns
'quantity == 1') df_products_monthly_quantity_12m_t.query(
invoice_year_month | stock_code_description | 2018-12 | 2019-01 | 2019-02 | 2019-03 | 2019-04 | 2019-05 | 2019-06 | 2019-07 | 2019-08 | 2019-09 | 2019-10 | 2019-11 | std_quantity | quantity | avg_quantity | cov_quantity |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
161 | 20703__BLUE PADDED SOFT MOBILE | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.29 | 1.00 | 0.08 | 3.46 |
250 | 20860__GOLD COSMETICS BAG WITH BUTTE... | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.29 | 1.00 | 0.08 | 3.46 |
310 | 21009__ETCHED GLASS STAR TREE DECORA... | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.29 | 1.00 | 0.08 | 3.46 |
375 | 21120__*Boombox Ipod Classic | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.29 | 1.00 | 0.08 | 3.46 |
399 | 21160__KEEP OUT GIRLS DOOR HANGER | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.29 | 1.00 | 0.08 | 3.46 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3816 | 90184c__BLACK CHUNKY BEAD BRACELET W... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.29 | 1.00 | 0.08 | 3.46 |
3823 | 90187A__BLUE DROP EARRINGS W BEAD CL... | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.29 | 1.00 | 0.08 | 3.46 |
3898 | DCGS0004__HAYNES CAMPER SHOULDER BAG | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.29 | 1.00 | 0.08 | 3.46 |
3899 | DCGS0069__OOH LA LA DOGS COLLAR | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.29 | 1.00 | 0.08 | 3.46 |
3900 | DCGS0070__CAMOUFLAGE DOG COLLAR | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.29 | 1.00 | 0.08 | 3.46 |
86 rows × 17 columns
= df_products_monthly_quantity_12m_t.query('cov_quantity >= 3.3')['stock_code_description'].unique()
products_high_cov 'stock_code_description in @products_high_cov and quantity ==1').sample(2)
df_ecom_filtered.query('stock_code_description in @products_high_cov and quantity ==1')['quantity'].value_counts() df_ecom_filtered.query(
invoice_no | stock_code | initial_description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | description | stock_code_description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2457 | 536591 | 21488 | RED WHITE SCARF HOT WATER BOTTLE | 1 | 2018-11-29 16:58:00 | 3.95 | 14606 | 2018 | 11 | 2018-11 | 48 | 2018-Week-48 | 2018-11-29 | 3 | Thursday | 3.95 | RED WHITE SCARF HOT WATER BOTTLE | 21488__RED WHITE SCARF HOT WATER BOTTLE |
15719 | 537640 | 22528 | GARDENERS KNEELING PAD | 1 | 2018-12-05 15:31:00 | 3.36 | 0 | 2018 | 12 | 2018-12 | 49 | 2018-Week-49 | 2018-12-05 | 2 | Wednesday | 3.36 | GARDENERS KNEELING PAD | 22528__GARDENERS KNEELING PAD |
quantity
1 462
Name: count, dtype: int64
Observations
Let’s proceed with classification of products by use of a xyz_classification
function.
def xyz_classification(cov):
"""
The function assigns a product to an XYZ class based on its coefficient of variation (CoV)
in order quantity, indicating quantity variability.
Input:
cov (float): The coefficient of variation in order quantity for the product.
Output:
str: 'X', 'Y', or 'Z' indicating the XYZ class based on the following thresholds:
- 'X' for products with low variability (CoV <= 0.5)
- 'Y' for products with moderate variability (0.5 < CoV <= 1.0)
- 'Z' for products with high variability (CoV > 1.0)
"""
if cov > 0 and cov <= 0.5:
return 'X'
elif cov > 0.5 and cov <= 1.0:
return 'Y'
else:
return 'Z'
'xyz_class'] = df_products_monthly_quantity_12m_t['cov_quantity'].apply(xyz_classification) df_products_monthly_quantity_12m_t[
# generating a summary of the distribution of stock codes across the classes
'xyz_class'].value_counts() df_products_monthly_quantity_12m_t[
xyz_class
Z 2530
Y 1062
X 318
Name: count, dtype: int64
Observations
# creating a DataFrame summarizing data on XYZ classes
= df_products_monthly_quantity_12m_t.groupby('xyz_class').agg(
xyz_summary =('stock_code_description', 'nunique'),
unique_products=('quantity', 'sum'),
quantity=('std_quantity', 'mean'),
std_quantity=('avg_quantity', 'mean'),
avg_quantity=('cov_quantity', 'mean'))
avg_cov_quantity
# calculating shares of product range of each class
'products_pct'] = round(xyz_summary['unique_products'] / xyz_summary['unique_products'].sum(), 2)
xyz_summary[
xyz_summary
unique_products | quantity | std_quantity | avg_quantity | avg_cov_quantity | products_pct | |
---|---|---|---|---|---|---|
xyz_class | ||||||
X | 318 | 1433994.00 | 144.49 | 375.78 | 0.41 | 0.08 |
Y | 1062 | 2029013.00 | 111.68 | 159.21 | 0.75 | 0.27 |
Z | 2530 | 1709007.00 | 88.12 | 56.29 | 1.91 | 0.65 |
# creating a DataFrame summarizing data on XYZ classes by months
= df_products_monthly_quantity_12m_t.groupby('xyz_class').agg(
df_products_monthly_quantity_12m_t_summary 'sum' for column in year_month_columns_12m})
{column:
df_products_monthly_quantity_12m_t_summary
invoice_year_month | 2018-12 | 2019-01 | 2019-02 | 2019-03 | 2019-04 | 2019-05 | 2019-06 | 2019-07 | 2019-08 | 2019-09 | 2019-10 | 2019-11 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
xyz_class | ||||||||||||
X | 85054.00 | 115208.00 | 100198.00 | 141957.00 | 99019.00 | 135671.00 | 113754.00 | 116235.00 | 130183.00 | 125887.00 | 127833.00 | 142995.00 |
Y | 109686.00 | 137224.00 | 122780.00 | 163584.00 | 137210.00 | 203910.00 | 161900.00 | 174793.00 | 179824.00 | 183074.00 | 209370.00 | 245658.00 |
Z | 104721.00 | 85589.00 | 54884.00 | 68356.00 | 56790.00 | 76801.00 | 94453.00 | 127998.00 | 129452.00 | 221951.00 | 319079.00 | 368933.00 |
# by use of "melt" method resetting index to convert columns into a DataFrame for further plotting
= df_products_monthly_quantity_12m_t_summary.reset_index().melt(id_vars='xyz_class', var_name='year_month', value_name='quantity')
df_products_monthly_quantity_12m_t_summary_m 6) df_products_monthly_quantity_12m_t_summary_m.head(
xyz_class | year_month | quantity | |
---|---|---|---|
0 | X | 2018-12 | 85054.00 |
1 | Y | 2018-12 | 109686.00 |
2 | Z | 2018-12 | 104721.00 |
3 | X | 2019-01 | 115208.00 |
4 | Y | 2019-01 | 137224.00 |
5 | Z | 2019-01 | 85589.00 |
# plotting a lineplot of monthly quantity per XYZ Class
=(8, 4))
plt.figure(figsize'RdYlGn_r')
sns.set_palette(
= sns.lineplot(data=df_products_monthly_quantity_12m_t_summary_m,
ax ='year_month',
x='quantity',
y='xyz_class',
hue='o',
marker=2.5,
linewidth=7)
markersize
'Monthly Quantity per XYZ Class', fontsize=16)
ax.set_title('Months', fontsize=12)
ax.set_xlabel('Quantity', fontsize=12)
ax.set_ylabel(
='XYZ Class', fontsize=10)
ax.legend(title=45)
plt.xticks(rotation; plt.show()
Next, we will gather our ABC and XYZ analyses data by gathering appropriate DataFrames.
= df_ecom_summary_12m[['stock_code_description', 'abc_class', 'revenue']].copy()
df_abc_summary = df_products_monthly_quantity_12m_t[['stock_code_description', 'std_quantity', 'quantity', 'avg_quantity', 'cov_quantity', 'xyz_class']].copy()
df_xyz_summary
= df_abc_summary.merge(df_xyz_summary, on='stock_code_description', how='left')
df_abc_xyz df_abc_xyz.head()
stock_code_description | abc_class | revenue | std_quantity | quantity | avg_quantity | cov_quantity | xyz_class | |
---|---|---|---|---|---|---|---|---|
0 | 22423__REGENCY CAKESTAND 3 TIER | A | 165414.75 | 276.81 | 13157.00 | 1096.42 | 0.25 | X |
1 | 85123A__WHITE HANGING HEART T-LIGHT ... | A | 100641.99 | 1455.14 | 36221.00 | 3018.42 | 0.48 | X |
2 | 47566__PARTY BUNTING | A | 98828.59 | 1010.70 | 18195.00 | 1516.25 | 0.67 | Y |
3 | 85099B__JUMBO BAG RED RETROSPOT | A | 92101.20 | 1406.56 | 47304.00 | 3942.00 | 0.36 | X |
4 | 23084__RABBIT NIGHT LIGHT | A | 59266.78 | 4470.61 | 27349.00 | 2279.08 | 1.96 | Z |
Let’s create an ABC-XYZ Class indication by combing abc_class
with xyz_class
values.
'abc_xyz_class'] = df_abc_xyz['abc_class'] + df_abc_xyz['xyz_class']
df_abc_xyz[3) df_abc_xyz.head(
stock_code_description | abc_class | revenue | std_quantity | quantity | avg_quantity | cov_quantity | xyz_class | abc_xyz_class | |
---|---|---|---|---|---|---|---|---|---|
0 | 22423__REGENCY CAKESTAND 3 TIER | A | 165414.75 | 276.81 | 13157.00 | 1096.42 | 0.25 | X | AX |
1 | 85123A__WHITE HANGING HEART T-LIGHT ... | A | 100641.99 | 1455.14 | 36221.00 | 3018.42 | 0.48 | X | AX |
2 | 47566__PARTY BUNTING | A | 98828.59 | 1010.70 | 18195.00 | 1516.25 | 0.67 | Y | AY |
# calculating ABC-XYZ summary
= df_abc_xyz.groupby('abc_xyz_class').agg(
df_abc_xyz_summary =('stock_code_description', 'nunique'),
unique_products=('quantity', 'sum'),
quantity=('avg_quantity', 'mean'),
avg_quantity=('revenue', 'sum'),
revenue=('cov_quantity', 'mean')
cov_quantity
).reset_index()
# calculating shares of totals of each group for revenue and product range
'revenue_pct'] = round(df_abc_xyz_summary['revenue'] / df_abc_xyz_summary['revenue'].sum(), 2)
df_abc_xyz_summary['quantity_pct'] = round(df_abc_xyz_summary['quantity'] / df_abc_xyz_summary['quantity'].sum(), 2)
df_abc_xyz_summary['products_pct'] = round(df_abc_xyz_summary['unique_products'] / df_abc_xyz_summary['unique_products'].sum(), 2)
df_abc_xyz_summary[
='revenue', ascending=False) df_abc_xyz_summary.sort_values(by
abc_xyz_class | unique_products | quantity | avg_quantity | revenue | cov_quantity | revenue_pct | quantity_pct | products_pct | |
---|---|---|---|---|---|---|---|---|---|
1 | AY | 342 | 1430568.00 | 348.58 | 3212072.15 | 0.71 | 0.34 | 0.28 | 0.09 |
0 | AX | 199 | 1255673.00 | 525.83 | 2277287.47 | 0.39 | 0.24 | 0.24 | 0.05 |
2 | AZ | 301 | 814339.00 | 225.45 | 2122595.92 | 1.57 | 0.22 | 0.16 | 0.08 |
8 | CZ | 1972 | 560928.00 | 23.70 | 600955.92 | 2.00 | 0.06 | 0.11 | 0.50 |
5 | BZ | 257 | 333740.00 | 108.22 | 475955.53 | 1.60 | 0.05 | 0.06 | 0.07 |
4 | BY | 191 | 290058.00 | 126.55 | 359947.21 | 0.73 | 0.04 | 0.06 | 0.05 |
7 | CY | 529 | 308387.00 | 48.58 | 305357.82 | 0.78 | 0.03 | 0.06 | 0.14 |
3 | BX | 62 | 120241.00 | 161.61 | 117392.21 | 0.42 | 0.01 | 0.02 | 0.02 |
6 | CX | 57 | 58080.00 | 84.91 | 46195.22 | 0.43 | 0.00 | 0.01 | 0.01 |
Most revenues come from AY Class.
# plotting a barplot of monthly products count by ABC-XYZ Class
= plt.subplots(figsize=(5, 3))
ax = sns.barplot(x='abc_xyz_class',
ax ='unique_products',
y=df_abc_xyz_summary,
data='RdYlGn_r')
palette'Number of Products by ABC-XYZ Class', fontsize=14); ax.set_title(
# plotting a barplot of monthly revenue by ABC-XYZ Class
= plt.subplots(figsize=(5, 3))
ax = sns.barplot(x='abc_xyz_class',
ax ='revenue',
y=df_abc_xyz_summary,
data='RdYlGn_r')
palette'Revenue by ABC-XYZ Class', fontsize=14)
ax.set_title(
# setting y-axis to display numbers in non-scientific format
'{x:,.0f}')); ax.yaxis.set_major_formatter(ticker.StrMethodFormatter(
# plotting a barplot of monthly quantity by ABC-XYZ Class
= plt.subplots(figsize=(5, 3))
ax = sns.barplot(x='abc_xyz_class',
ax ='quantity',
y=df_abc_xyz_summary,
data='RdYlGn_r')
palette'Quantity by ABC-XYZ Class', fontsize=14)
ax.set_title(
# setting y-axis to display numbers in non-scientific format
'{x:,.0f}')); ax.yaxis.set_major_formatter(ticker.StrMethodFormatter(
# plotting the bubble chart of quantity and revenue for ABC-XYZ analysis
= px.scatter(
fig
df_abc_xyz_summary,='revenue',
x='quantity',
y='revenue',
size='revenue',
color='RdYlGn',
color_continuous_scale='abc_xyz_class',
hover_name='abc_xyz_class',
text='ABC-XYZ Analysis Bubble Chart of Quantity vs. Revenue')
title
fig.update_layout(=650,
height=650,
width=0.5,
title_x=0.9)
title_y='middle left')
fig.update_traces(textposition; fig.show()
Let’s also examine monthly sales volume dynamics of the ABC-XYZ Classes altogether.
# merging the DataFrames to obtain ABC-XYZ Class and monthly sales volume distribution for each product altogether
= df_products_monthly_quantity_12m_t.merge(df_abc_xyz, on='stock_code_description', how='left')
df_products_monthly_quantity_12m_t_classes 3) df_products_monthly_quantity_12m_t_classes.head(
stock_code_description | 2018-12 | 2019-01 | 2019-02 | 2019-03 | 2019-04 | 2019-05 | 2019-06 | 2019-07 | 2019-08 | 2019-09 | 2019-10 | 2019-11 | std_quantity_x | quantity_x | avg_quantity_x | cov_quantity_x | xyz_class_x | abc_class | revenue | std_quantity_y | quantity_y | avg_quantity_y | cov_quantity_y | xyz_class_y | abc_xyz_class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10002__INFLATABLE POLITICAL GLOBE | 190.00 | 340.00 | 54.00 | 146.00 | 69.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 107.66 | 799.00 | 66.58 | 1.62 | Z | C | 708.04 | 107.66 | 799.00 | 66.58 | 1.62 | Z | CZ |
1 | 10080__GROOVY CACTUS INFLATABLE | 0.00 | 0.00 | 2.00 | 0.00 | 0.00 | 0.00 | 60.00 | 60.00 | 24.00 | 60.00 | 30.00 | 67.00 | 28.79 | 303.00 | 25.25 | 1.14 | Z | C | 119.09 | 28.79 | 303.00 | 25.25 | 1.14 | Z | CZ |
2 | 10120__DOGGY RUBBER | 16.00 | 0.00 | 30.00 | 28.00 | 0.00 | 3.00 | 0.00 | 10.00 | 30.00 | 10.00 | 11.00 | 48.00 | 15.35 | 186.00 | 15.50 | 0.99 | Y | C | 39.06 | 15.35 | 186.00 | 15.50 | 0.99 | Y | CY |
# creating a DataFrame summarizing data on ABC-XYZ classes by months
= df_products_monthly_quantity_12m_t_classes.groupby('abc_xyz_class').agg(
df_products_monthly_quantity_12m_t_classes_summary 'sum' for column in year_month_columns_12m}).reset_index()
{column:
df_products_monthly_quantity_12m_t_classes_summary
abc_xyz_class | 2018-12 | 2019-01 | 2019-02 | 2019-03 | 2019-04 | 2019-05 | 2019-06 | 2019-07 | 2019-08 | 2019-09 | 2019-10 | 2019-11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AX | 76791.00 | 99735.00 | 85816.00 | 125705.00 | 85447.00 | 119803.00 | 98408.00 | 101038.00 | 115995.00 | 111189.00 | 111077.00 | 124669.00 |
1 | AY | 74979.00 | 90936.00 | 83677.00 | 113700.00 | 92900.00 | 149295.00 | 110102.00 | 119830.00 | 134356.00 | 131195.00 | 149799.00 | 179799.00 |
2 | AZ | 43167.00 | 22116.00 | 17042.00 | 14144.00 | 17308.00 | 37584.00 | 45599.00 | 57676.00 | 65904.00 | 120825.00 | 166211.00 | 206763.00 |
3 | BX | 5436.00 | 10459.00 | 9643.00 | 11127.00 | 9444.00 | 11057.00 | 10393.00 | 10103.00 | 9232.00 | 9554.00 | 11520.00 | 12273.00 |
4 | BY | 16839.00 | 20378.00 | 19561.00 | 23590.00 | 22838.00 | 28796.00 | 24554.00 | 25116.00 | 21408.00 | 24836.00 | 30054.00 | 32088.00 |
5 | BZ | 18011.00 | 15544.00 | 9371.00 | 19753.00 | 9386.00 | 12411.00 | 17448.00 | 22826.00 | 26456.00 | 41206.00 | 63437.00 | 77891.00 |
6 | CX | 2827.00 | 5014.00 | 4739.00 | 5125.00 | 4128.00 | 4811.00 | 4953.00 | 5094.00 | 4956.00 | 5144.00 | 5236.00 | 6053.00 |
7 | CY | 17868.00 | 25910.00 | 19542.00 | 26294.00 | 21472.00 | 25819.00 | 27244.00 | 29847.00 | 24060.00 | 27043.00 | 29517.00 | 33771.00 |
8 | CZ | 43543.00 | 47929.00 | 28471.00 | 34459.00 | 30096.00 | 26806.00 | 31406.00 | 47496.00 | 37092.00 | 59920.00 | 89431.00 | 84279.00 |
# by use of "melt" method resetting index to convert columns into a DataFrame for further plotting
= df_products_monthly_quantity_12m_t_classes_summary.reset_index().melt(id_vars='abc_xyz_class', var_name='year_month', value_name='quantity')
df_products_monthly_quantity_12m_t_classes_summary_m 6) df_products_monthly_quantity_12m_t_classes_summary_m.head(
abc_xyz_class | year_month | quantity | |
---|---|---|---|
0 | AX | index | 0.00 |
1 | AY | index | 1.00 |
2 | AZ | index | 2.00 |
3 | BX | index | 3.00 |
4 | BY | index | 4.00 |
5 | BZ | index | 5.00 |
# plotting a lineplot of monthly quantity per ABC-XYZ Class
=(12, 8))
plt.figure(figsize'RdYlGn_r')
sns.set_palette(= sns.lineplot( data=df_products_monthly_quantity_12m_t_classes_summary_m,
ax ='year_month',
x='quantity',
y='abc_xyz_class',
hue='o',
marker=2.5,
linewidth=7)
markersize'Monthly Quantity per ABC-XYZ Class', fontsize=16)
ax.set_title('Months', fontsize=12)
ax.set_xlabel('Quantity', fontsize=12)
ax.set_ylabel(
='ABC-XYZ Class', fontsize=10)
ax.legend(title=45)
plt.xticks(rotation; plt.show()
Observations
ABC classification summary (we’ve followed the revenue-based approach)
XYZ classification summary (sales stability)
ABC-XYZ analysis summary (revenue and sales stability)
Monthly quantity per ABC-XYZ Class
⚠ Note: we included new products in ABC-XYZ analysis, as they may represent a substantial part of the dataset. However, they may be underestimated and misclassified due to their short sales track, so we will flag them and study separately in the next steps.
Let’s review the share of returns.
Note: - ⚠ In this study, we consider only returns from mutually exclusive entries with negative quantities, as we’re focusing on product-related entries to identify products returned more often. The other negative quantity entries have been analyzed previously.
# checking the share of returns
= returns_excl.copy().sort_values(by='quantity')
returns =True, show_qty_rev=True, show_example=True, example_type='head', example_limit=5) share_evaluation(returns, df_ecom, show_boxplots
======================================================================================================================================================
Evaluation of share: returns
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 3139 (0.6% of all entries)
Quantity: -228936 (4.4% of the total quantity)
Revenue: -454347.9 (4.7% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
Top rows:
invoice_no stock_code description quantity invoice_date unit_price customer_id invoice_year invoice_month \
540422 C581484 23843 PAPER CRAFT , LITTLE BIRDIE -80995 2019-12-07 09:27:00 2.08 16446 2019 12
61624 C541433 23166 MEDIUM CERAMIC TOP STORAGE JAR -74215 2019-01-16 10:17:00 1.04 12346 2019 1
160145 C550456 21108 FAIRY CAKE FLANNEL ASSORTED COLOUR -3114 2019-04-16 13:08:00 2.10 15749 2019 4
160144 C550456 21175 GIN + TONIC DIET METAL SIGN -2000 2019-04-16 13:08:00 1.85 15749 2019 4
160143 C550456 85123A WHITE HANGING HEART T-LIGHT HOLDER -1930 2019-04-16 13:08:00 2.55 15749 2019 4
invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue
540422 2019-12 49 2019-Week-49 2019-12-07 5 Saturday -168469.60
61624 2019-01 3 2019-Week-03 2019-01-16 2 Wednesday -77183.60
160145 2019-04 16 2019-Week-16 2019-04-16 1 Tuesday -6539.40
160144 2019-04 16 2019-Week-16 2019-04-16 1 Tuesday -3700.00
160143 2019-04 16 2019-Week-16 2019-04-16 1 Tuesday -4921.50
======================================================================================================================================================
Observations
Decisions
⚠ Note: mutually exclusive entries can represent either actual product returns or corrections of order placement errors. While distinguishing between these cases can be difficult or impossible, we’ve addressed the main outliers and excluded operational entries. Therefore, our approach to defining returns remains valid for this study.
# removing the top outliers and different kinds of non-product related operations from the returns DataFrame
= lambda df: df.query(
operation 'quantity > -20000 \
and description not in @service_operations_descriptions \
and stock_code not in @other_service_stock_codes \
and description not in @delivery_related_operations_set')
= data_reduction(returns, operation) returns_filtered
Number of entries cleaned out from the "returns": 77 (2.5%)
# checking the share of filtered data on returns
=True, show_qty_rev=True, show_example=True, example_type='head', example_limit=3) share_evaluation(returns_filtered, df_ecom, show_boxplots
======================================================================================================================================================
Evaluation of share: returns_filtered
in df_ecom
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 3062 (0.6% of all entries)
Quantity: -73490 (1.4% of the total quantity)
Revenue: -149250.5 (1.5% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
Top rows:
invoice_no stock_code description quantity invoice_date unit_price customer_id invoice_year invoice_month \
160145 C550456 21108 FAIRY CAKE FLANNEL ASSORTED COLOUR -3114 2019-04-16 13:08:00 2.10 15749 2019 4
160144 C550456 21175 GIN + TONIC DIET METAL SIGN -2000 2019-04-16 13:08:00 1.85 15749 2019 4
160143 C550456 85123A WHITE HANGING HEART T-LIGHT HOLDER -1930 2019-04-16 13:08:00 2.55 15749 2019 4
invoice_year_month invoice_week invoice_year_week invoice_day invoice_day_of_week invoice_day_name revenue
160145 2019-04 16 2019-Week-16 2019-04-16 1 Tuesday -6539.40
160144 2019-04 16 2019-Week-16 2019-04-16 1 Tuesday -3700.00
160143 2019-04 16 2019-Week-16 2019-04-16 1 Tuesday -4921.50
======================================================================================================================================================
Observations - The filtered returns that can be definitively matched to corresponding sales represent 1.4% of the total quantity and 1.5% of the total revenue. - Although the impact of verifiable returns appears less significant than initially thought, we will proceed with the planned studies. This approach will help reveal insights on top returns and returns seasonality, and the established methodology may be useful for future recurring studies.
Let’s create a stock_code_description
column, representing joined keys of stock code and description for returns. So we can match return and general data on this parameter.
# creating the `stock_code_description` column
= returns_filtered.copy() #avoiding SettingWithCopyWarning in the next step
returns_filtered 'stock_code_description'] = returns_filtered['stock_code'] + "__" + returns_filtered['description'] returns_filtered[
# getting the summary on returns grouped by `stock_code_description`
= (
returns_filtered_summary 'stock_code_description']).agg({'unit_price':'mean', 'quantity' : 'sum', 'revenue':'sum', 'stock_code_description':'count', 'invoice_no':'nunique'})
returns_filtered.groupby([={'invoice_no':'unique_invoices', 'stock_code_description':'entries', 'unit_price':'unit_price_mean'})
.rename(columns
.reset_index()='quantity', ascending=True).round(1))
.sort_values(by
returns_filtered_summary.head()
stock_code_description | unit_price_mean | quantity | revenue | entries | unique_invoices | |
---|---|---|---|---|---|---|
96 | 21108__FAIRY CAKE FLANNEL ASSORTED C... | 1.70 | -3150 | -6591.40 | 3 | 3 |
1323 | 85123A__WHITE HANGING HEART T-LIGHT ... | 2.90 | -2524 | -6473.80 | 12 | 12 |
115 | 21175__GIN + TONIC DIET METAL SIGN | 2.30 | -2024 | -3761.20 | 3 | 3 |
773 | 22920__HERB MARKER BASIL | 0.60 | -1527 | -841.00 | 2 | 2 |
435 | 22273__FELTCRAFT DOLL MOLLY | 2.40 | -1440 | -3492.00 | 2 | 1 |
# getting the summary of the cleaned original DataFrame grouped by `stock_code_description`
= (
df_ecom_filtered_summary 'stock_code_description']).agg({'unit_price':'mean', 'quantity' : 'sum', 'revenue':'sum', 'stock_code_description':'count', 'invoice_no':'nunique'})
df_ecom_filtered.groupby([={'invoice_no':'unique_invoices', 'stock_code_description':'entries', 'unit_price':'unit_price_mean'})
.rename(columns
.reset_index()='quantity', ascending=True).round(1))
.sort_values(by
5, random_state=7) df_ecom_filtered_summary.sample(
stock_code_description | unit_price_mean | quantity | revenue | entries | unique_invoices | |
---|---|---|---|---|---|---|
1159 | 22259__FELT FARM ANIMAL HEN | 1.00 | 481 | 332.80 | 45 | 44 |
2152 | 23311__VINTAGE CHRISTMAS STOCKING | 3.00 | 2390 | 6488.20 | 347 | 344 |
113 | 18094C__WHITE AND BLUE CERAMIC OIL B... | 2.00 | 192 | 283.90 | 42 | 42 |
3671 | 90083__CRYSTAL CZECH CROSS PHONE CHARM | 1.50 | 25 | 23.50 | 9 | 9 |
3818 | 90183B__AMETHYST DROP EARRINGS W LON... | 2.90 | 21 | 61.10 | 17 | 17 |
In the next step we will join the summary of the original DataFrame with that of the returns.
Then we will add columns returns_rate
and returns_loss_rate
. Where Return rate describes the percentage of entries representing returns from the total number of entries and Returns Loss Rate describes the share of returns from the total sales.
# merging the summaries of the original DataFrame and that of returns
= df_ecom_filtered_summary.merge(returns_filtered_summary, on='stock_code_description', how='inner', suffixes=('', '_returns'))
df_ecom_filtered_with_returns_summary 5, random_state=7) df_ecom_filtered_with_returns_summary.sample(
stock_code_description | unit_price_mean | quantity | revenue | entries | unique_invoices | unit_price_mean_returns | quantity_returns | revenue_returns | entries_returns | unique_invoices_returns | |
---|---|---|---|---|---|---|---|---|---|---|---|
666 | 21875__KINGS CHOICE MUG | 1.80 | 2055 | 2429.40 | 149 | 148 | 1.20 | -24 | -30.00 | 1 | 1 |
794 | 23349__ROLL WRAP VINTAGE CHRISTMAS | 1.50 | 3221 | 4342.60 | 343 | 337 | 1.20 | -24 | -30.00 | 2 | 2 |
134 | 84952B__BLACK LOVE BIRD T-LIGHT HOLDER | 3.00 | 186 | 332.10 | 29 | 29 | 3.80 | -3 | -11.20 | 1 | 1 |
308 | 22181__SNOWSTORM PHOTO FRAME FRIDGE ... | 1.00 | 591 | 500.90 | 57 | 57 | 0.80 | -24 | -20.40 | 1 | 1 |
164 | 21363__HOME SMALL WOOD LETTERS | 6.50 | 243 | 1403.90 | 129 | 125 | 5.00 | -12 | -59.40 | 4 | 4 |
# adding columns describing overall return rate and loss rate of each product
'returns_rate'] = df_ecom_filtered_with_returns_summary['entries_returns'] / df_ecom_filtered_with_returns_summary['entries']
df_ecom_filtered_with_returns_summary['returns_loss_rate'] = abs(df_ecom_filtered_with_returns_summary['revenue_returns'] / df_ecom_filtered_with_returns_summary['revenue'])
df_ecom_filtered_with_returns_summary[
3, random_state=10) df_ecom_filtered_with_returns_summary.sample(
stock_code_description | unit_price_mean | quantity | revenue | entries | unique_invoices | unit_price_mean_returns | quantity_returns | revenue_returns | entries_returns | unique_invoices_returns | returns_rate | returns_loss_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
223 | 37500__TEA TIME TEAPOT IN GIFT BOX | 7.20 | 380 | 2360.70 | 113 | 113 | 7.40 | -2 | -14.90 | 2 | 2 | 0.02 | 0.01 |
177 | 84968A__SET OF 16 VINTAGE ROSE CUTLERY | 13.80 | 267 | 3139.30 | 99 | 98 | 12.80 | -8 | -102.00 | 2 | 2 | 0.02 | 0.03 |
339 | 85032C__CURIOUS IMAGES GIFT WRAP SET | 1.20 | 683 | 718.50 | 141 | 140 | 0.60 | -12 | -7.80 | 1 | 1 | 0.01 | 0.01 |
# checking descriptive statistics on returns
print('\033[1mDescriptive statistics on returns:\033[0m')
'returns_rate','returns_loss_rate']].describe().applymap(lambda x: f'{x:.3f}') df_ecom_filtered_with_returns_summary[[
Descriptive statistics on returns:
returns_rate | returns_loss_rate | |
---|---|---|
count | 1051.000 | 1051.000 |
mean | 0.024 | 0.043 |
std | 0.066 | 0.101 |
min | 0.001 | 0.000 |
25% | 0.005 | 0.005 |
50% | 0.009 | 0.011 |
75% | 0.020 | 0.031 |
max | 1.000 | 1.000 |
Now let’s visualize the distributions of Returns Rate and Returns Loss Rate. We will use a combination of kernel density estimate (KDE) plots and scatter plots for better overview of the data patterns and relationships.
# creating a figure with two subplots
= plt.subplots(1, 2, figsize=(20, 6))
fig, (ax1, ax2)
# plotting KDE plots
for column, color in zip(['returns_rate', 'returns_loss_rate'], ['darksalmon', 'darkred']):
=df_ecom_filtered_with_returns_summary[column] * 100, ax=ax1, linewidth=3, alpha=0.7, color=color, label=column.replace('_', ' ').title())
sns.kdeplot(data
'Distribution of Returns Rates and Returns Loss Rates', fontsize=16, fontweight='bold')
ax1.set_title('Rate (%)', fontsize=12)
ax1.set_xlabel('Density', fontsize=12)
ax1.set_ylabel(True, linestyle='--', alpha=0.7)
ax1.grid(
ax1.legend()
# plotting scatter plot
'returns_rate'] * 100,
ax2.scatter(df_ecom_filtered_with_returns_summary['returns_loss_rate'] * 100,
df_ecom_filtered_with_returns_summary[='darkred', alpha=0.6)
color
'Returns Rate vs Returns Loss Rate', fontsize=16, fontweight='bold')
ax2.set_title('Returns Rate (%)', fontsize=12)
ax2.set_xlabel('Returns Loss Rate (%)', fontsize=12)
ax2.set_ylabel(True, linestyle='--', alpha=0.7)
ax2.grid(
0.1, -0.1, f'NOTE 1: Returns Rate represents the share of return entries, while Returns Loss Rate indicates the percentage of total revenue lost due to returns for corresponding products. \n\nNOTE 2: Return volume may be slightly higher due to returns that are processed outside our defined detection rules, such as same-product returns at different volumes or prices.', ha='left', fontsize=10, style='italic', wrap=True)
plt.figtext(
#plt.tight_layout()
; plt.show()
Observations
returns_rate
(describing the share of return entries) and returns_loss_rate
(describing the share of total price of returns from the total revenue of corresponding products).returns_rate
and 0.04 (4%) for returns_loss_rate
.In the next step, we will analyze products with the highest returned quantities and highest losses due to returns (negative revenue values). To focus on significant products, we will filter out those with low purchase frequency and minimal sales volume. Similarly to the Most Expensive Products study approach, we will exclude products whose total volume sold and total orders are below the 25 percentile of these metrics.
# filtering out unpopular products
= df_ecom_filtered_with_returns_summary.query('quantity > = @products_quantity_25_percentile and unique_invoices >= @products_invoices_25_percentile') df_ecom_filtered_with_returns_summary_popular
= df_ecom_filtered_with_returns_summary_popular['stock_code_description'].tolist()
returned_products_popular = returns_filtered.query('stock_code_description in @returned_products_popular')
returns_filtered_popular 5, random_state=7) returns_filtered_popular.sample(
invoice_no | stock_code | description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | stock_code_description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
526969 | C580686 | 22963 | JAM JAR WITH GREEN LID | -6 | 2019-12-03 15:28:00 | 0.85 | 15984 | 2019 | 12 | 2019-12 | 49 | 2019-Week-49 | 2019-12-03 | 1 | Tuesday | -5.10 | 22963__JAM JAR WITH GREEN LID |
111850 | C545837 | 22181 | SNOWSTORM PHOTO FRAME FRIDGE MAGNET | -24 | 2019-03-05 13:32:00 | 0.85 | 12598 | 2019 | 3 | 2019-03 | 10 | 2019-Week-10 | 2019-03-05 | 1 | Tuesday | -20.40 | 22181__SNOWSTORM PHOTO FRAME FRIDGE ... |
224503 | C556530 | 22501 | PICNIC BASKET WICKER LARGE | -3 | 2019-06-11 11:42:00 | 9.95 | 18109 | 2019 | 6 | 2019-06 | 24 | 2019-Week-24 | 2019-06-11 | 1 | Tuesday | -29.85 | 22501__PICNIC BASKET WICKER LARGE |
49849 | C540535 | 20914 | SET/5 RED RETROSPOT LID GLASS BOWLS | -2 | 2019-01-07 14:17:00 | 2.95 | 15005 | 2019 | 1 | 2019-01 | 2 | 2019-Week-02 | 2019-01-07 | 0 | Monday | -5.90 | 20914__SET/5 RED RETROSPOT LID GLASS... |
47483 | C540417 | 20719 | WOODLAND CHARLOTTE BAG | -30 | 2019-01-05 10:56:00 | 0.85 | 13680 | 2019 | 1 | 2019-01 | 1 | 2019-Week-01 | 2019-01-05 | 5 | Saturday | -25.50 | 20719__WOODLAND CHARLOTTE BAG |
# checking distribution and totals of quantity and revenue among top 20 products by returned quantity and loss due to returns (highest negative values of returns)
for parameter in ['quantity', 'revenue']:
'stock_code_description', parameter, sample_type='tail', sort_ascending=True, n_items=20, show_outliers=True, consistent_colors=True) plot_totals_distribution(returns_filtered_popular,
Also, let’s find out how many products with the highest Returns Rates and products with highest Returns Loss Rates are the same, we will do that by comparing the 50 products with highest rates of each parameter.
= set(
top_50_returns_rate_products ='returns_rate')
df_ecom_filtered_with_returns_summary_popular.sort_values(by'stock_code_description'].tail(50))
[
= set(
top_50_returns_loss_rate_products ='returns_loss_rate')
df_ecom_filtered_with_returns_summary_popular.sort_values(by'stock_code_description'].tail(50)) [
= top_50_returns_rate_products.intersection(top_50_returns_loss_rate_products)
common_products = len(common_products)
number_of_common_products= number_of_common_products / 50
share_of_common_products
print(f'\033[1mCommon products among top 50 by Returns Rate and top 50 by Returns Loss Rate:\033[0m {number_of_common_products} out of 50 ({share_of_common_products :0.1%})')
Common products among top 50 by Returns Rate and top 50 by Returns Loss Rate: 16 out of 50 (32.0%)
Observations
Two products stand out with the largest negative quantities: “FAIRY CAKE FLANNEL ASSORTED COLOUR” (-3.1k units) and “WHITE HANGING HEART T-LIGHT HOLDER” (-2.5k units), suggesting significant return volumes.
The distribution chart shows most products have relatively narrow return quantity ranges**, with a few exceptions showing wider variability in return volumes. Interestingly, the “WHITE HANGING HEART T-LIGHT HOLDER” appears in both bottom charts (quantity and revenue), indicating this popular item also experiences substantial returns.
The top revenue loss comes from “FAIRY CAKE FLANNEL ASSORTED COLOUR” (-6k revenue) and “WHITE HANGING HEART T-LIGHT HOLDER” (-5.5k revenue), aligning with their high return quantities.
The distribution chart shows most products have narrow ranges of revenue loss as well.
💡 The negative revenue impact appears more concentrated than the quantity impact, with the top seven products representing significantly larger losses than the rest of the list.
💡 Our analysis reveals a significant overlap between high Returns Rates and high Returns Loss Rates. Specifically, 32% (16 out of 50) of the products appear in both the top 50 lists for highest Returns Rates and highest Returns Loss Rates. This observation proves a strong correlation between the frequency of returns and the financial impact of those returns for these stock codes.
As the overall period of our dataset covers not full months only, in the next step, we will filter our data on returns so that it includes only the entire calendar months.
= data_reduction(returns_filtered, lambda df: df.query('invoice_year_month >= "2018-12" and invoice_year_month < "2019-12"')) returns_filtered_12m
Number of entries cleaned out from the "returns_filtered": 79 (2.6%)
Let’s create a DataFrame presenting monthly summary of returns_rate
and returns_loss_rate
.
= returns_filtered_12m.groupby('invoice_year_month').agg({
monthly_returns_summary'revenue': 'sum',
'quantity': 'sum',
'stock_code_description': ['count','nunique'],
'invoice_no': 'nunique',
'customer_id': 'nunique',
'invoice_year_month')
}).reset_index().sort_values(
= ['invoice_year_month', 'revenue', 'quantity', 'entries', 'unique_products', 'unique_invoices', 'unique_customers']
monthly_returns_summary.columns 3) monthly_returns_summary.head(
invoice_year_month | revenue | quantity | entries | unique_products | unique_invoices | unique_customers | |
---|---|---|---|---|---|---|---|
0 | 2018-12 | -7593.15 | -2971 | 169 | 144 | 95 | 84 |
1 | 2019-01 | -7873.56 | -3356 | 212 | 186 | 95 | 78 |
2 | 2019-02 | -4395.85 | -1449 | 100 | 86 | 70 | 64 |
Let’s merge the summaries of the original DataFrame and that of returns.
# merging the summaries of the original DataFrame and the DataFrame of returns, where both are time-bounded
= monthly_summary.merge(monthly_returns_summary, on='invoice_year_month', how='inner', suffixes=('', '_returns'))
monthly_summary_with_returns
# adding columns describing overall return rate and loss rate of each stock code
'returns_rate'] = monthly_summary_with_returns['entries_returns'] / monthly_summary_with_returns['entries']
monthly_summary_with_returns['returns_loss_rate'] = abs(monthly_summary_with_returns['revenue_returns'] / monthly_summary_with_returns['revenue'])
monthly_summary_with_returns[
3) monthly_summary_with_returns.head(
invoice_year_month | revenue | quantity | unique_invoices | entries | unique_products | unique_customers | unit_price_mean | unit_price_median | revenue_change_pct | quantity_change_pct | unique_invoices_change_pct | unique_products_change_pct | unique_customers_change_pct | unit_price_mean_change_pct | revenue_absolute_change_pct | quantity_absolute_change_pct | unique_invoices_absolute_change_pct | unique_products_absolute_change_pct | unique_customers_absolute_change_pct | unit_price_mean_absolute_change_pct | invoice_year_month_float | revenue_returns | quantity_returns | entries_returns | unique_products_returns | unique_invoices_returns | unique_customers_returns | returns_rate | returns_loss_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2018-12 | 670676.20 | 299461 | 1282 | 35788 | 2736 | 769 | 3.86 | 2.55 | NaN | NaN | NaN | NaN | NaN | NaN | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2018.99 | -7593.15 | -2971 | 169 | 144 | 95 | 84 | 0.00 | 0.01 |
1 | 2019-01 | 641890.68 | 338021 | 1205 | 36781 | 2602 | 806 | 3.35 | 2.10 | -4.29 | 12.88 | -6.01 | -4.90 | 4.81 | -13.39 | -4.29 | 12.88 | -6.01 | -4.90 | 4.81 | -13.39 | 2019.08 | -7873.56 | -3356 | 212 | 186 | 95 | 78 | 0.01 | 0.01 |
2 | 2019-02 | 502201.30 | 277862 | 1071 | 26089 | 2396 | 745 | 3.56 | 2.46 | -21.76 | -17.80 | -11.12 | -7.92 | -7.57 | 6.53 | -25.12 | -7.21 | -16.46 | -12.43 | -3.12 | -7.74 | 2019.16 | -4395.85 | -1449 | 100 | 86 | 70 | 64 | 0.00 | 0.01 |
Let’s visualize our analysis by creating a combined graph of returns_rate
and returns_loss_rate
by month. We will use a Plotly Scatter plot with the trend line option, thus benefiting from both Plotly’s interactivity and possibility to detect trends in the metrics, if any.
#converting the `invoice_year_month` column to datetime
'invoice_year_month'] = pd.to_datetime(monthly_summary_with_returns['invoice_year_month'], format='%Y-%m')
monthly_summary_with_returns[
# creating a scatter plot with trend lines
= px.scatter(monthly_summary_with_returns,
fig ='invoice_year_month',
x=['returns_rate', 'returns_loss_rate'],
y='Returns Rate and Returns Loss Rate by Month',
title='lowess', # here we use the Locally Weighted Scatterplot Smoothing, that follows the general data trend
trendline=dict(frac=0.7),
trendline_options=['darksalmon', 'darkred'],
color_discrete_sequence=[2.5]*len(monthly_summary_with_returns)) # setting marker sizes
size
# adjusting the appearance
fig.update_layout(='Year-Month',
xaxis_title='Rate (%)',
yaxis_title=1200,
width=600,
height=0.5,
title_x=.95,
title_y={'orientation': 'h', 'yanchor': 'bottom', 'y': 1.02, 'xanchor': 'right', 'x': 1},
legend='')
legend_title
# adding the note about trend lines
fig.add_annotation(='paper', x=0,
xref='paper', y=-0.18,
yref='NOTE: the dashed lines represent general data trends for the Returns Rate and Returns Loss Rate (based on the Locally Weighted Scatterplot Smoothing).',
text=False,
showarrow=dict(size=11))
font
=dict(dash='dash'))
fig.update_traces(line='.1%')
fig.update_yaxes(tickformat; fig.show()
The highest return loss rate month was April 2019, so let’s analyze the products that caused the most return-related losses that month.
# analyzing products that caused the most return-related losses in the highest Return Loss Rate month - April 2019
= returns_filtered_12m.query('invoice_year_month == "2019-04"')
returns_2019_04
'stock_code_description', 'revenue', title_extension='in Returns of April 2019', n_items=10, sample_type='tail', show_outliers=False, sort_ascending=True) plot_totals_distribution(returns_2019_04,
Let’s add a float representation of invoice_year_month
. This will allow us to include months in our further correlation analysis of monthly-grouped parameters, thus helping detect influence of seasonality.
# building a correlation matrix and heatmap
= monthly_summary_with_returns[['invoice_year_month', 'revenue', 'quantity', 'unique_invoices', 'returns_rate', 'returns_loss_rate']].corr().round(2)
corr_matrix_monthly_summary_with_returns
=(10, 8))
plt.figure(figsize'Correlation Heatmap of General and Returns Parameters Grouped by Month', fontsize=16)
plt.title(
# avoiding showing the duplicating data on the heatmap
= np.triu(np.ones_like(corr_matrix_monthly_summary_with_returns))
hide_triangle_mask
# plotting a heatmap and rotating the names on axis
= sns.heatmap(corr_matrix_monthly_summary_with_returns, mask=hide_triangle_mask, annot=True, cmap='RdYlGn', vmin=-1, vmax=1, linewidths=0.7)
heatmap =45, ha='right')
plt.setp(heatmap.get_xticklabels(), rotation=0, ha='right') ; plt.setp(heatmap.get_yticklabels(), rotation
Observations
At this stage we will complement our ABC-XYZ analysis by data indicating return levels of products. So they can be address accordingly. E.g. a product in top-performing AX class, but having poor return scores would need an extra attention (for example root-cause analysis of high returns) prior to promotional activities.
We will develop and apply rate_classification
function to define returns_rate
and returns_loss_rate
levels, thus highlighting products worth attention.
def rate_classification(rate, percentile_25, percentile_50, percentile_75):
"""
This function classifies a rate into categories based on provided percentile thresholds.
Inputs:
- rate (float): The rate to be classified (e.g., Return rate or Return Loss Rate).
- percentile_25 (float): The 25th percentile threshold.
- percentile_50 (float): The 50th percentile threshold.
- percentile_75 (float): The 75th percentile threshold.
Output:
str: A class label indicating the level of the rate:
- 'low' for rates at or below the 25th percentile
- 'moderate' for rates between the 25th and 50th percentile
- 'high' for rates between the 50th and 75th percentile
- 'very high' for rates above the 75th percentile
"""
if rate <= percentile_25:
return 'low'
elif rate <= percentile_50:
return 'moderate'
elif rate <= percentile_75:
return 'high'
else:
return 'very high'
Let’s apply the rate_classification()
function above and assign appropriate classes of returns.
# calculating percentiles for `returns_rate`
= np.percentile(df_ecom_filtered_with_returns_summary['returns_rate'], 25)
returns_rate_25_percentile = np.percentile(df_ecom_filtered_with_returns_summary['returns_rate'], 50)
returns_rate_50_percentile = np.percentile(df_ecom_filtered_with_returns_summary['returns_rate'], 75)
returns_rate_75_percentile
# applying classification for `returns_rate`
'returns_rate_class'] = df_ecom_filtered_with_returns_summary['returns_rate'].apply(
df_ecom_filtered_with_returns_summary[lambda x: rate_classification(x, returns_rate_25_percentile, returns_rate_50_percentile, returns_rate_75_percentile))
# calculating percentiles for `returns_loss_rate`
= np.percentile(df_ecom_filtered_with_returns_summary['returns_loss_rate'], 25)
returns_loss_rate_25_percentile = np.percentile(df_ecom_filtered_with_returns_summary['returns_loss_rate'], 50)
returns_loss_rate_50_percentile = np.percentile(df_ecom_filtered_with_returns_summary['returns_loss_rate'], 75)
returns_loss_rate_75_percentile
# printing out the summary on the rates classification
print('\033[1mReturn rate Classification:\033[0m')
print(f'Low: <= {returns_rate_25_percentile:.1%}')
print(f'Moderate: > {returns_rate_25_percentile:.1%} but <= {returns_rate_50_percentile:.1%}')
print(f'High: > {returns_rate_50_percentile:.1%} but <= {returns_rate_75_percentile:.1%}')
print(f'Very High: > {returns_rate_75_percentile:.1%}')
print('\n\033[1mReturn Loss Rate Classification:\033[0m')
print(f'Low: <= {returns_loss_rate_25_percentile:.1%}')
print(f'Moderate: > {returns_loss_rate_25_percentile:.1%} but <= {returns_loss_rate_50_percentile:.1%}')
print(f'High: > {returns_loss_rate_50_percentile:.1%} but <= {returns_loss_rate_75_percentile:.1%}')
print(f'Very High: > {returns_loss_rate_75_percentile:.1%}')
# applying classification for `returns_loss_rate`
'returns_loss_rate_class'] = df_ecom_filtered_with_returns_summary['returns_loss_rate'].apply(
df_ecom_filtered_with_returns_summary[lambda x: rate_classification(x, returns_loss_rate_25_percentile, returns_loss_rate_50_percentile, returns_loss_rate_75_percentile))
# checking the result
3, random_state=7) df_ecom_filtered_with_returns_summary.sample(
Return rate Classification:
Low: <= 0.5%
Moderate: > 0.5% but <= 0.9%
High: > 0.9% but <= 2.0%
Very High: > 2.0%
Return Loss Rate Classification:
Low: <= 0.5%
Moderate: > 0.5% but <= 1.1%
High: > 1.1% but <= 3.1%
Very High: > 3.1%
stock_code_description | unit_price_mean | quantity | revenue | entries | unique_invoices | unit_price_mean_returns | quantity_returns | revenue_returns | entries_returns | unique_invoices_returns | returns_rate | returns_loss_rate | returns_rate_class | returns_loss_rate_class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
666 | 21875__KINGS CHOICE MUG | 1.80 | 2055 | 2429.40 | 149 | 148 | 1.20 | -24 | -30.00 | 1 | 1 | 0.01 | 0.01 | moderate | high |
794 | 23349__ROLL WRAP VINTAGE CHRISTMAS | 1.50 | 3221 | 4342.60 | 343 | 337 | 1.20 | -24 | -30.00 | 2 | 2 | 0.01 | 0.01 | moderate | moderate |
134 | 84952B__BLACK LOVE BIRD T-LIGHT HOLDER | 3.00 | 186 | 332.10 | 29 | 29 | 3.80 | -3 | -11.20 | 1 | 1 | 0.03 | 0.03 | very high | very high |
Let’s create a function to assign a combined return score. We simplify the return analysis combining return_rate
and returns_loss_rate
, meanwhile they could be checked separately if necessary.
def combined_return_score(rate_class, loss_class):
= {'low': 1, 'moderate': 2, 'high': 3, 'very high': 4}
scores return scores[rate_class] + scores[loss_class]
# applying the function to create a new column
'return_score'] = df_ecom_filtered_with_returns_summary.apply(
df_ecom_filtered_with_returns_summary[lambda x: combined_return_score(x['returns_rate_class'], x['returns_loss_rate_class']), axis=1)
3, random_state=7) df_ecom_filtered_with_returns_summary.sample(
stock_code_description | unit_price_mean | quantity | revenue | entries | unique_invoices | unit_price_mean_returns | quantity_returns | revenue_returns | entries_returns | unique_invoices_returns | returns_rate | returns_loss_rate | returns_rate_class | returns_loss_rate_class | return_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
666 | 21875__KINGS CHOICE MUG | 1.80 | 2055 | 2429.40 | 149 | 148 | 1.20 | -24 | -30.00 | 1 | 1 | 0.01 | 0.01 | moderate | high | 5 |
794 | 23349__ROLL WRAP VINTAGE CHRISTMAS | 1.50 | 3221 | 4342.60 | 343 | 337 | 1.20 | -24 | -30.00 | 2 | 2 | 0.01 | 0.01 | moderate | moderate | 4 |
134 | 84952B__BLACK LOVE BIRD T-LIGHT HOLDER | 3.00 | 186 | 332.10 | 29 | 29 | 3.80 | -3 | -11.20 | 1 | 1 | 0.03 | 0.03 | very high | very high | 8 |
Now let’s create a function to categorize the return score.
def categorize_return_score(score):
if score <= 2:
return 'R1' # low returns
elif score <= 4:
return 'R2' # moderate returns
elif score <= 6:
return 'R3' # high returns
else:
return 'R4' # very high returns
# applying the function to create a new column
'return_class'] = df_ecom_filtered_with_returns_summary['return_score'].apply(categorize_return_score)
df_ecom_filtered_with_returns_summary[3, random_state=7) df_ecom_filtered_with_returns_summary.sample(
stock_code_description | unit_price_mean | quantity | revenue | entries | unique_invoices | unit_price_mean_returns | quantity_returns | revenue_returns | entries_returns | unique_invoices_returns | returns_rate | returns_loss_rate | returns_rate_class | returns_loss_rate_class | return_score | return_class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
666 | 21875__KINGS CHOICE MUG | 1.80 | 2055 | 2429.40 | 149 | 148 | 1.20 | -24 | -30.00 | 1 | 1 | 0.01 | 0.01 | moderate | high | 5 | R3 |
794 | 23349__ROLL WRAP VINTAGE CHRISTMAS | 1.50 | 3221 | 4342.60 | 343 | 337 | 1.20 | -24 | -30.00 | 2 | 2 | 0.01 | 0.01 | moderate | moderate | 4 | R2 |
134 | 84952B__BLACK LOVE BIRD T-LIGHT HOLDER | 3.00 | 186 | 332.10 | 29 | 29 | 3.80 | -3 | -11.20 | 1 | 1 | 0.03 | 0.03 | very high | very high | 8 | R4 |
Now let’s combine ABC-XYZ class with the return class.
# merging DataFrames with ABC-XYZ analyses and returns
= df_abc_xyz.merge(df_ecom_filtered_with_returns_summary[['stock_code_description', 'returns_rate_class', 'returns_loss_rate_class', 'return_class']], on='stock_code_description', how='left').fillna('R0') # assigning R0 return score for cases without returns
df_abc_xyz_returns 3, random_state=7) df_abc_xyz_returns.sample(
stock_code_description | abc_class | revenue | std_quantity | quantity | avg_quantity | cov_quantity | xyz_class | abc_xyz_class | returns_rate_class | returns_loss_rate_class | return_class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1844 | 21707__FOLDING UMBRELLA BLACKBLUE PO... | C | 743.86 | 10.91 | 156.00 | 13.00 | 0.84 | Y | CY | R0 | R0 | R0 |
3437 | 90059E__DIAMANTE HAIR GRIP PACK/2 RUBY | C | 31.47 | 2.23 | 19.00 | 1.58 | 1.41 | Z | CZ | very high | very high | R4 |
836 | 23212__HEART WREATH DECORATION WITH ... | A | 2655.48 | 225.87 | 2152.00 | 179.33 | 1.26 | Z | AZ | high | high | R3 |
Let’s check counts of return_class
values and then visualize them by plotting a pie-chart.
# adding `returns_explanation`column
= df_abc_xyz_returns['return_class'].value_counts().reset_index()
return_class_counts = ['return_class', 'count']
return_class_counts.columns 'returns_explanation'] = return_class_counts['return_class'].apply(
return_class_counts[lambda x: 'No Returns detected' if x == 'R0' else
'Low returns (score <= 2)' if x == 'R1' else
'Moderate returns (2 < score <= 4)' if x == 'R2' else
'High returns (4 < score <= 6)' if x == 'R3' else
'Very high returns (score > 6)')
return_class_counts
return_class | count | returns_explanation | |
---|---|---|---|
0 | R0 | 2859 | No Returns detected |
1 | R3 | 304 | High returns (4 < score <= 6) |
2 | R4 | 296 | Very high returns (score > 6) |
3 | R2 | 291 | Moderate returns (2 < score <= 4) |
4 | R1 | 160 | Low returns (score <= 2) |
# creating a pie chart of return classes distribution
= plt.subplots(figsize=(7, 7))
fig, ax = sns.color_palette('pastel')
colors
'count'],
ax.pie(return_class_counts[=return_class_counts['return_class'] + ' - ' + return_class_counts['returns_explanation'],
labels='%1.1f%%',
autopct=90,
startangle=colors)
colors
'Distribution of Return Classes', fontsize=16)
ax.set_title(
#plt.tight_layout()
; plt.show()
Now let’s create the abc_xyz_return_class
column combining ABC-XYZ and returns analyses.
'abc_xyz_return_class'] = df_abc_xyz_returns['abc_xyz_class'] + '_' + df_abc_xyz_returns['return_class']
df_abc_xyz_returns[3, random_state=7) df_abc_xyz_returns.sample(
stock_code_description | abc_class | revenue | std_quantity | quantity | avg_quantity | cov_quantity | xyz_class | abc_xyz_class | returns_rate_class | returns_loss_rate_class | return_class | abc_xyz_return_class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1844 | 21707__FOLDING UMBRELLA BLACKBLUE PO... | C | 743.86 | 10.91 | 156.00 | 13.00 | 0.84 | Y | CY | R0 | R0 | R0 | CY_R0 |
3437 | 90059E__DIAMANTE HAIR GRIP PACK/2 RUBY | C | 31.47 | 2.23 | 19.00 | 1.58 | 1.41 | Z | CZ | very high | very high | R4 | CZ_R4 |
836 | 23212__HEART WREATH DECORATION WITH ... | A | 2655.48 | 225.87 | 2152.00 | 179.33 | 1.26 | Z | AZ | high | high | R3 | AZ_R3 |
# creating a DataFrame summarizing data on `abc_xyz_return_class`
= df_abc_xyz_returns.groupby('abc_xyz_return_class').agg(
df_abc_xyz_returns_summary =('stock_code_description', 'nunique'),
unique_products=('quantity', 'sum'),
quantity=('avg_quantity', 'mean'),
avg_quantity=('revenue', 'sum'),
revenue=('cov_quantity', 'mean'),
cov_quantity
).reset_index()
='revenue', ascending=False).sample(5, random_state=7) df_abc_xyz_returns_summary.sort_values(by
abc_xyz_return_class | unique_products | quantity | avg_quantity | revenue | cov_quantity | |
---|---|---|---|---|---|---|
1 | AX_R1 | 42 | 326237.00 | 647.30 | 426030.68 | 0.37 |
9 | AY_R4 | 24 | 35205.00 | 122.24 | 173130.40 | 0.72 |
36 | CY_R4 | 50 | 28581.00 | 47.63 | 30011.55 | 0.79 |
16 | BX_R1 | 3 | 12293.00 | 341.47 | 6468.67 | 0.46 |
25 | BZ_R0 | 163 | 222166.00 | 113.58 | 298684.55 | 1.66 |
Let’s recollect that we had defined new products as those having sales within the last three months, but none before.
We will extract the last 3 months and then create a column assigning new products according to our definition.
# extracting necessary months
= year_month_columns_12m[-3:]
last_3_months = year_month_columns_12m[:-3]
all_except_last_3_months
last_3_months all_except_last_3_months
['2019-09', '2019-10', '2019-11']
['2018-12',
'2019-01',
'2019-02',
'2019-03',
'2019-04',
'2019-05',
'2019-06',
'2019-07',
'2019-08']
# creating a column, indicating whether the product is treated as a new one
'new_product'] = (
df_products_monthly_quantity_12m_t[> 0).any(axis=1) & # sales in any of the last 3 months and
(df_products_monthly_quantity_12m_t[last_3_months] == 0).all(axis=1)) # no sales within earlier months
(df_products_monthly_quantity_12m_t[all_except_last_3_months]
3) df_products_monthly_quantity_12m_t.head(
invoice_year_month | stock_code_description | 2018-12 | 2019-01 | 2019-02 | 2019-03 | 2019-04 | 2019-05 | 2019-06 | 2019-07 | 2019-08 | 2019-09 | 2019-10 | 2019-11 | std_quantity | quantity | avg_quantity | cov_quantity | xyz_class | new_product |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10002__INFLATABLE POLITICAL GLOBE | 190.00 | 340.00 | 54.00 | 146.00 | 69.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 107.66 | 799.00 | 66.58 | 1.62 | Z | False |
1 | 10080__GROOVY CACTUS INFLATABLE | 0.00 | 0.00 | 2.00 | 0.00 | 0.00 | 0.00 | 60.00 | 60.00 | 24.00 | 60.00 | 30.00 | 67.00 | 28.79 | 303.00 | 25.25 | 1.14 | Z | False |
2 | 10120__DOGGY RUBBER | 16.00 | 0.00 | 30.00 | 28.00 | 0.00 | 3.00 | 0.00 | 10.00 | 30.00 | 10.00 | 11.00 | 48.00 | 15.35 | 186.00 | 15.50 | 0.99 | Y | False |
# checking the share of new products
'new_product'].mean() df_products_monthly_quantity_12m_t[
0.07340153452685422
# enriching `df_abc_xyz` DataFrame with the column, indicating new products
= df_abc_xyz.copy().merge(df_products_monthly_quantity_12m_t[['stock_code_description','new_product']], on = 'stock_code_description', how='left')
df_abc_xyz_new_products 3, random_state=3) df_abc_xyz_new_products.sample(
stock_code_description | abc_class | revenue | std_quantity | quantity | avg_quantity | cov_quantity | xyz_class | abc_xyz_class | new_product | |
---|---|---|---|---|---|---|---|---|---|---|
1638 | 22307__GOLD MUG BONE CHINA TREE OF LIFE | C | 956.05 | 102.06 | 764.00 | 63.67 | 1.60 | Z | CZ | False |
549 | 20974__12 PENCILS SMALL TUBE SKULL | A | 4431.47 | 286.96 | 6840.00 | 570.00 | 0.50 | Y | AY | False |
454 | 23526__WALL ART DOG LICENCE | A | 5241.39 | 171.52 | 855.00 | 71.25 | 2.41 | Z | AZ | True |
Now let’s create the abc_xyz_new_products
column combining ABC-XYZ and new products analyses.
'abc_xyz_products'] = df_abc_xyz_new_products.apply(
df_abc_xyz_new_products[lambda x: x['abc_xyz_class'] + '_New Product' if x['new_product'] else x['abc_xyz_class'] + '_Old Product',
=1)
axis
3, random_state=3) df_abc_xyz_new_products.sample(
stock_code_description | abc_class | revenue | std_quantity | quantity | avg_quantity | cov_quantity | xyz_class | abc_xyz_class | new_product | abc_xyz_products | |
---|---|---|---|---|---|---|---|---|---|---|---|
1638 | 22307__GOLD MUG BONE CHINA TREE OF LIFE | C | 956.05 | 102.06 | 764.00 | 63.67 | 1.60 | Z | CZ | False | CZ_Old Product |
549 | 20974__12 PENCILS SMALL TUBE SKULL | A | 4431.47 | 286.96 | 6840.00 | 570.00 | 0.50 | Y | AY | False | AY_Old Product |
454 | 23526__WALL ART DOG LICENCE | A | 5241.39 | 171.52 | 855.00 | 71.25 | 2.41 | Z | AZ | True | AZ_New Product |
# evaluating new products
= df_abc_xyz_new_products['new_product'].count()
total_products_number = len(df_abc_xyz_new_products.query('new_product == False'))
old_products_number = df_abc_xyz_new_products['new_product'].sum()
new_products_number = df_abc_xyz_new_products['new_product'].mean()
new_products_share
f'**Summary on products:**'))
display(Markdown(print(f'\033[1mAll products:\033[0m {total_products_number}')
print(f'\033[1mEstablished products:\033[0m {old_products_number} ({(1-new_products_share) * 100 :0.1f}%)')
print(f'\033[1mNew products:\033[0m {new_products_number} ({new_products_share * 100 :0.1f}%)')
Summary on products:
All products: 3910
Established products: 3623 (92.7%)
New products: 287 (7.3%)
# creating a DataFrame with summary on new products only
= df_abc_xyz_new_products.copy().query('new_product == True')
df_abc_xyz_new_products_only df_abc_xyz_new_products_only
stock_code_description | abc_class | revenue | std_quantity | quantity | avg_quantity | cov_quantity | xyz_class | abc_xyz_class | new_product | abc_xyz_products | |
---|---|---|---|---|---|---|---|---|---|---|---|
196 | 23581__JUMBO BAG PAISLEY PARK | A | 10732.64 | 994.24 | 4607.00 | 383.92 | 2.59 | Z | AZ | True | AZ_New Product |
236 | 23582__VINTAGE DOILY JUMBO BAG RED | A | 9255.36 | 1045.14 | 4302.00 | 358.50 | 2.92 | Z | AZ | True | AZ_New Product |
275 | 23534__WALL ART STOP FOR TEA | A | 8024.07 | 260.89 | 1323.00 | 110.25 | 2.37 | Z | AZ | True | AZ_New Product |
278 | 23493__VINTAGE DOILY TRAVEL SEWING KIT | A | 7921.17 | 666.86 | 3695.00 | 307.92 | 2.17 | Z | AZ | True | AZ_New Product |
323 | 23535__WALL ART BICYCLE SAFETY | A | 7039.68 | 214.75 | 1101.00 | 91.75 | 2.34 | Z | AZ | True | AZ_New Product |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3875 | 85049c__ROMANTIC PINKS RIBBONS | C | 2.46 | 0.29 | 1.00 | 0.08 | 3.46 | Z | CZ | True | CZ_New Product |
3892 | 23664__FLOWER SHOP DESIGN MUG | C | 1.65 | 0.29 | 1.00 | 0.08 | 3.46 | Z | CZ | True | CZ_New Product |
3893 | 84550__CROCHET LILAC/RED BEAR KEYRING | C | 1.65 | 0.29 | 1.00 | 0.08 | 3.46 | Z | CZ | True | CZ_New Product |
3904 | 84206B__CAT WITH SUNGLASSES BLANK CARD | C | 0.95 | 1.44 | 5.00 | 0.42 | 3.46 | Z | CZ | True | CZ_New Product |
3907 | 51014c__FEATHER PEN,COAL BLACK | C | 0.83 | 0.29 | 1.00 | 0.08 | 3.46 | Z | CZ | True | CZ_New Product |
287 rows × 11 columns
# determining a list of new products
= df_abc_xyz_new_products_only['stock_code_description'].to_list()
new_products_list_12m 3] #sample of new products new_products_list_12m[:
['23581__JUMBO BAG PAISLEY PARK',
'23582__VINTAGE DOILY JUMBO BAG RED',
'23534__WALL ART STOP FOR TEA']
# extracting entries of new products
= df_ecom_filtered_12m.copy().query('stock_code_description in @new_products_list_12m') df_ecom_filtered_12m_new_products_only
# checking the volume of new products' entries
= True,
share_evaluation(df_ecom_filtered_12m_new_products_only, df_ecom_filtered_12m, show_qty_rev =True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code_description', 'nunique'): 'Products Coverage',
('customer_id', 'nunique'): 'Customers Coverage'},
(=True,
show_pie_charts_notes=True,
show_boxplots=True) show_period
======================================================================================================================================================
Evaluation of share: df_ecom_filtered_12m_new_products_only
in df_ecom_filtered_12m
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 16125 (3.2% of all entries)
Quantity: 132086 (2.6% of the total quantity)
Revenue: 334395.6 (3.5% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered_12m
falls into df_ecom_filtered_12m_new_products_only
.df_ecom_filtered_12m
is generated in df_ecom_filtered_12m_new_products_only
.df_ecom_filtered_12m
occurs in df_ecom_filtered_12m_new_products_only
. Every entry is counted separately, even if they are associated with the same order.df_ecom_filtered_12m_new_products_only
, it still counts as one full unique order in this chart.df_ecom_filtered_12m_new_products_only
, it still counts as one full unique product in this chart.df_ecom_filtered_12m_new_products_only
, they still count as one full unique customer in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
Invoice period coverage: 2019-09-02 - 2019-11-30 (24.5%; 89 out of 364 total days; 3 out of 12 total months)
======================================================================================================================================================
Let’s also check impact new product had in the last 3 moth only (above we studied the share and impact of new product on the 12 month dataset, currently we study only the period were the new products appear according to our definition).
# defining the last 3 month DataFrame
= df_ecom_filtered_12m.copy().query('invoice_year_month in @last_3_months') df_ecom_filtered_3m
# checking the volume of new products' entries
= True,
share_evaluation(df_ecom_filtered_12m_new_products_only, df_ecom_filtered_3m, show_qty_rev =True,
show_pie_charts={
pie_chart_parameters'quantity', 'sum'): 'Quantity Share',
('revenue', 'sum'): 'Revenue Share',
('invoice_no', 'count'): 'Entries Share',
('invoice_no', 'nunique'): 'Invoices Coverage',
('stock_code_description', 'nunique'): 'Products Coverage',
('customer_id', 'nunique'): 'Customers Coverage'},
(=True,
show_pie_charts_notes=True,
show_boxplots=True) show_period
======================================================================================================================================================
Evaluation of share: df_ecom_filtered_12m_new_products_only
in df_ecom_filtered_3m
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 16125 (8.5% of all entries)
Quantity: 132086 (6.8% of the total quantity)
Revenue: 334395.6 (9.3% of the total revenue)
------------------------------------------------------------------------------------------------------------------------------------------------------
df_ecom_filtered_3m
falls into df_ecom_filtered_12m_new_products_only
.df_ecom_filtered_3m
is generated in df_ecom_filtered_12m_new_products_only
.df_ecom_filtered_3m
occurs in df_ecom_filtered_12m_new_products_only
. Every entry is counted separately, even if they are associated with the same order.df_ecom_filtered_12m_new_products_only
, it still counts as one full unique order in this chart.df_ecom_filtered_12m_new_products_only
, it still counts as one full unique product in this chart.df_ecom_filtered_12m_new_products_only
, they still count as one full unique customer in this chart.------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
Invoice period coverage: 2019-09-02 - 2019-11-30 (100.0%; 89 out of 89 total days; 3 out of 3 total months)
======================================================================================================================================================
Observations
From the boxplots above, we can see that there are outstanding entries in terms of quantity and revenue. Let’s identify whether there are new products that significantly outperform others. We will use our plot_totals_distribution
function for this purpose.
# checking distribution and totals of quantity and revenue among top 20 new products by quantity and revenue
for parameter in ['quantity', 'revenue']:
'stock_code_description', parameter, title_extension='among New Products', n_items=20, show_outliers=False) plot_totals_distribution(df_ecom_filtered_12m_new_products_only,
Let’s check how many new products that are leaders in sales volume are also leaders in revenue. We will compare the two lists of the top 20 products in each parameter.
= set(
top_20_new_products_quantity ='quantity')
df_abc_xyz_new_products_only.sort_values(by'stock_code_description'].tail(20))
[
= set(
top_20_new_products_revenue ='revenue')
df_abc_xyz_new_products_only.sort_values(by'stock_code_description'].tail(20)) [
= top_20_new_products_quantity.intersection(top_20_new_products_revenue)
common_products = len(common_products)
number_of_common_products= number_of_common_products / 20
share_of_common_products
print(f'\033[1mCommon products among top 50 new products by quantity and revenue:\033[0m {number_of_common_products} out of 20 ({share_of_common_products :0.1%})')
Common products among top 50 new products by quantity and revenue: 6 out of 20 (30.0%)
Observations
At this stage, we will complement our ABC-XYZ analysis with data on new products so they can be addressed accordingly. For instance, products in the AZ and BZ groups of new products should not be downgraded due to their high volatility, as they are still new and have not yet had the chance to realize their full potential.
# creating the DataFrame summarizing data on `abc_xyz_new_products`
= df_abc_xyz_new_products.groupby('abc_xyz_products').agg(
df_abc_xyz_new_products_summary =('stock_code_description', 'nunique'),
unique_products=('quantity', 'sum'),
quantity=('avg_quantity', 'mean'),
avg_quantity=('revenue', 'sum'),
revenue=('cov_quantity', 'mean'),
cov_quantity
).reset_index()
='revenue', ascending=False) df_abc_xyz_new_products_summary.sort_values(by
abc_xyz_products | unique_products | quantity | avg_quantity | revenue | cov_quantity | |
---|---|---|---|---|---|---|
1 | AY_Old Product | 342 | 1430568.00 | 348.58 | 3212072.15 | 0.71 |
0 | AX_Old Product | 199 | 1255673.00 | 525.83 | 2277287.47 | 0.39 |
3 | AZ_Old Product | 262 | 764899.00 | 243.29 | 1932539.73 | 1.45 |
11 | CZ_Old Product | 1763 | 514756.00 | 24.33 | 528140.52 | 1.91 |
7 | BZ_Old Product | 218 | 297266.00 | 113.63 | 404431.48 | 1.47 |
5 | BY_Old Product | 191 | 290058.00 | 126.55 | 359947.21 | 0.73 |
9 | CY_Old Product | 529 | 308387.00 | 48.58 | 305357.82 | 0.78 |
2 | AZ_New Product | 39 | 49440.00 | 105.64 | 190056.19 | 2.44 |
4 | BX_Old Product | 62 | 120241.00 | 161.61 | 117392.21 | 0.42 |
10 | CZ_New Product | 209 | 46172.00 | 18.41 | 72815.40 | 2.76 |
6 | BZ_New Product | 39 | 36474.00 | 77.94 | 71524.05 | 2.30 |
8 | CX_Old Product | 57 | 58080.00 | 84.91 | 46195.22 | 0.43 |
# plotting a bubble chart for ABC-XYZ & New Products analysis
= px.scatter(
fig
df_abc_xyz_new_products_summary,='revenue',
x='quantity',
y='revenue',
size='revenue',
color='RdYlGn',
color_continuous_scale='abc_xyz_products',
hover_name='abc_xyz_products',
text='ABC-XYZ & New Products Analysis: Bubble Chart of Quantity vs. Revenue')
title
fig.update_layout(=650,
height=650,
width=0.5,
title_x=0.9)
title_y='middle left')
fig.update_traces(textposition; fig.show()
At this part of our study we will test several hypotheses, aiming to gain insights valuable for further business decisions.
So the hypotheses to test are following:
Impact of Price on A-Class Product Sales Hypothesis
Reasoning: Revenue is generated by both price of and quantity of sold products. This test aims to reveal whether higher-priced (price above median) or lower-priced (price below median) A-class products are selling better. And then we can decide on which of them to focus our marketing and inventory efforts.
⚠ Note: *Here we consider A-class products according to ABC matrix, bringing 80% of the total revenue. In current test of hypotheses we decided to focus on A-class products only, as they generate the major share of revenues while representing just about 20% of all products. If we would run tests on the whole set of products, less valuable products might affect our study, potentially decreasing its significance and practical value.
New vs. Established Products: Average Daily Sales Hypothesis
Reasoning: During the Time-based Analysis and Correlation Analysis stages, we revealed that the number of unique products is highly correlated with the total quantity sold. This test can help us evaluate the success of new products and complement our study of the effect of launching new products on sales volume. If new products are sold significantly better than established products, it might support more frequent product launches and greater investment in their marketing. Conversely, if established products are selling better, it could suggest focusing on improving inventory and marketing for existing products.
We will use “average quantity sold per product” - as the key metric for this study, as it’s not influenced by pricing differences, what could affect the study if we compare a revenue-based metric.
As we already know, sales vary significantly over time. With this in mind, we will base our testing of the current hypothesis on the same time slot: the last full three months for both new and established products.
Note 1: By “new products” we consider all entries from products introduced in the last three months. By “established products,”* we consider products introduced before the last three months but only take into account their entries from the last three months.
Note 2: We must consider that both sales volume and pricing of new products may be heavily affected by marketing campaigns run alongside the introduction of those products. Currently, we lack data to verify such influence. The last three months might also be affected by seasonal trends that could impact new and established products differently. Keeping this in mind, we aim to define major patterns in this test. If we don’t observe them, we cannot be confident in our assumptions unless we examine marketing policies, campaigns, and their major sales effects (e.g., changes in pricing).
To determine the appropriate statistical test, we need to check the normality of our data distributions. Given our large dataset, we will focus on visual inspection of a distribution shape and examination of skewness, rather than relying on the Shapiro-Wilk test, which is known for poor p-value accuracy on large sample sizes (N > 5000).
Our distribution_IQR
function would be handy once again even for this purpose. As it provides both histograms and boxplots of distribution for visual inspection of its symmetry and tails, as well s calculation and explanation of skewness.
Based on the results of this examination, we can choose an appropriate statistical test type.
For testing our hypotheses, we will use a function called testing_averages
. This function conducts statistical tests to compare two samples, determines the appropriate test based on data normality, calculates descriptive statistics, and optionally creates a histogram for visual comparison (it’s a development from the previous projects, which we slightly modified for the current tasks).
The function’s normality check is based on the Shapiro-Wilk test. As we mentioned above, it’s not very reliable on large samples, so we will double-check test normality assumptions with our visual inspection of the distribution shape and examination of skewness.
The function testing_averages
creates two histograms on the same plot. Since sample sizes we compare may differ significantly the number of bins on the histograms must be adjusted accordingly for better visual comparison. We will determine optimal number of bins automatically by use of Freedman-Diaconis formula, realized in the bins_calculation
function.
For consistency with our ABC-XYZ analysis, which considered only entire months, we will use the same 12-month period for our hypothesis testing.
Function: bins_calculation
def bins_calculation(data, min_bins=10, max_bins=5000):
"""
This function calculates the optimal number of bins for a histogram using the Freedman-Diaconis rule, where bin width is based on IQR of the data.
The minimum and maximum number of bins can be specified. By default: min_bins=10, max_bins=5000.
"""
# removing NaN values, if any
= data.dropna()
data
# calculating the interquartile range (IQR)
= np.percentile(data, [75, 25])
q75, q25 = q75 - q25
iqr
# calculating bin width and number
= 2 * iqr * (len(data) ** (-1/3))
bin_width = np.max(data) - np.min(data)
data_range = int(np.ceil(data_range / bin_width))
num_bins
= max(min_bins, min(num_bins, max_bins))
num_bins_limited
return num_bins_limited
Function: testing_averages
def testing_averages(df1, df2, parameter, alpha=0.05, descriptive_stat=True, x_limits=None, histogram=True):
"""
This function conducts statistical tests to compare two samples, determines the appropriate test based on data normality,
calculates descriptive statistics and optionally creates a histogram for visual comparison.
Parameters:
- df1 (pandas.DataFrame): first DataFrame containing the data to be analyzed.
- df2 (pandas.DataFrame): second DataFrame containing the data to be analyzed.
- parameter (str): the column name in both DataFrames to be analyzed and compared.
- alpha (float, optional): significance level for hypothesis testing. Default - 0.05.
- descriptive_stat (bool, optional): whether to display descriptive statistics. Default - True.
- x_limits (list of float, optional): the x-axis limits for the histogram. If None, limits are set automatically. Default - None.
- bins (int, optional): number of bins for the histogram based on df1 data. Default - 700.
- histogram (bool, optional): whether to display a histogram. Default - True.
Returns:
None. Prints the results of the hypothesis test, descriptive statistics, and displays a histogram.
----------------
Note: for large sample sizes (N > 5000) the function warns that visual inspection and skewness examination are recommended
to verify the results of the Shapiro-Wilk test, as it may reject normality even for approximately normal data in large datasets.
----------------
"""
= df1[parameter]
sample1 = df2[parameter]
sample2
# checking normality in both samples using Shapiro-Wilk test
with warnings.catch_warnings():
"ignore", message="p-value may not be accurate for N > 5000.")
warnings.filterwarnings(= stats.shapiro(sample1)
stat1, p1_norm = stats.shapiro(sample2)
stat2, p2_norm
if p1_norm > alpha and p2_norm > alpha:
# if both samples are normal, perform a t-test and calculate mean as typical statistic, otherwise calculate median
# also check the equality of variances using Levene's test
= np.mean
typical_stat = 'mean'
typical_stat_name = stats.levene(sample1, sample2)
statslev, p_levene
if p_levene < alpha:
# variances are not equal, use Welch's t-test (unequal variances)
= stats.ttest_ind(sample1, sample2, equal_var=False)
stat_t, p_value = f'\033[1Welch\'s t-test performed\033[0m (as both samples are normal but variances are not equal)'
test_choiсe else:
# variances are equal, use Student's t-test (equal variances)
= stats.ttest_ind(sample1, sample2, equal_var=True)
stat_t, p_value = f'\033[1mt-test performed\033[0m (as both samples are normal and variances are equal)'
test_choiсe else:
# if one or both samples are not normal, perform a Mann-Whitney U test (non-parametric)
= np.median
typical_stat = 'median'
typical_stat_name = stats.mannwhitneyu(sample1, sample2)
stat_t, p_value = f'\033[1mMann-Whitney U test performed\033[0m (as one or both samples are not normal)'
test_choiсe
# printing test results
print()
f'**Testing averages of \"{parameter}\" in \"{get_df_name(df1)}\" and \"{get_df_name(df2)}\"**'))
display(Markdown(print('='*100)
if len(sample1) > 5000 or len(sample2) > 5000:
print(
f'\033[1;31mNote\033[0m:\033[1m Visual inspection of the distributions shape and examination of skewness is recommended to verify results of Shapiro-Wilk test of normality.\033[0m'
f' (The Shapiro-Wilk and other normality tests may reject normality even for approximately normal data, on large sample sizes as currently.)'
f'\n{"-"*100}')
print(test_choiсe)
print('-'*100)
print(f'P-value: {p_value:.3f}')
if p_value < alpha:
print(f'\033[1;31mReject the null hypothesis (H0)\033[0m: there are significant differences between the groups.')
else:
print(f'\033[1;32mFail to reject the null hypothesis (H0)\033[0m: there is no significant evidence of differences between the groups.')
print('-'*100)
if descriptive_stat:
# calculating and displaying descriptive statistics
# if both distributions are normal we will provide information on means, otherwise on medians, as they better represent typical values when distributions are significantly screwed.
print(f'\033[1mDescriptive statistics\033[0m:\n')
print(f'{typical_stat_name} of \"{parameter}\" in \"{get_df_name(df1)}\": {round(typical_stat(sample1),1)}')
print(f'{typical_stat_name} of \"{parameter}\" in \"{get_df_name(df2)}\": {round(typical_stat(sample2),1)}')
= (typical_stat(sample2) - typical_stat(sample1)) / typical_stat(sample1) * 100
relative_difference print(
f'The relative difference in {typical_stat_name}s: '
f'{relative_difference:.1f}% \n'
f'({"increase" if relative_difference > 0 else "decrease"} from \"{parameter}\" in \"{get_df_name(df1)}\" '
f'to \"{parameter}\" \"{get_df_name(df2)}\")\n')
print(f'Variance of \"{parameter}\" in \"{get_df_name(df1)}\": {round(np.var(sample1),1)}')
print(f'Variance of \"{parameter}\" in \"{get_df_name(df2)}\": {round(np.var(sample2),1)}\n')
print(f'Standard Deviation of \"{parameter}\" in \"{get_df_name(df1)}\": {round(np.sqrt(np.var(sample1)),1)}')
print(f'Standard Deviation of \"{parameter}\" in \"{get_df_name(df2)}\": {round(np.sqrt(np.var(sample2)),1)}')
print('-'*100)
if histogram:
# calculating bins for the larger sample
= sample1 if len(sample1) >= len(sample2) else sample2
larger_sample = sample2 if len(sample1) >= len(sample2) else sample1
smaller_sample = bins_calculation(larger_sample)
bins_larger
# adjusting bins for the smaller sample proportionally to the sample sizes
= max(10, int(bins_larger * (len(smaller_sample) / len(larger_sample))))
bins_smaller
# assigning bins to samples
if len(sample1) >= len(sample2):
= bins_larger, bins_smaller
bins1, bins2 else:
= bins_smaller, bins_larger
bins1, bins2
# plotting collective histogram
=True, stat='density', color='green', alpha=0.5, bins=bins1, label=f'{parameter} in {get_df_name(df1)} (1)')
sns.histplot(sample1, kde=True, stat='density', color='blue', alpha=0.5, bins=bins2, label=f'{parameter} in {get_df_name(df2)} (2)')
sns.histplot(sample2, kde
plt.xlabel(parameter)'Distribution Density')
plt.ylabel(
= f'Collective Histogram of \"{parameter}\" in \"{get_df_name(df1)}\" and \"{get_df_name(df2)}\", bins (1) = {bins1}, bins (2) = {bins2}'
title = wrap_text(title, 70) # adjusting title width when it's necessary
wrapped_title =1.03)
plt.title(wrapped_title, y
# set manual xlim if it's provided
if x_limits is not None:
plt.xlim(x_limits)
plt.legend();
plt.show()
print('='*100)
The hypotheses:
# getting a list of unique A-class class units
= df_ecom_summary_12m.query('abc_class =="A"')['stock_code_description'].unique().tolist()
a_class_units_list len(a_class_units_list)
3] # sample a_class_units_list[:
842
['22423__REGENCY CAKESTAND 3 TIER',
'85123A__WHITE HANGING HEART T-LIGHT HOLDER',
'47566__PARTY BUNTING']
# getting all entries with A-class class units
= df_ecom_filtered_12m.copy().query('stock_code_description in @a_class_units_list')
a_class_units_entries_12m 3) a_class_units_entries_12m.head(
invoice_no | stock_code | initial_description | quantity | invoice_date | unit_price | customer_id | invoice_year | invoice_month | invoice_year_month | invoice_week | invoice_year_week | invoice_day | invoice_day_of_week | invoice_day_name | revenue | description | stock_code_description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5220 | 536847 | 22065 | CHRISTMAS PUDDING TRINKET POT | 24 | 2018-12-01 09:31:00 | 1.45 | 17135 | 2018 | 12 | 2018-12 | 48 | 2018-Week-48 | 2018-12-01 | 5 | Saturday | 34.80 | CHRISTMAS PUDDING TRINKET POT | 22065__CHRISTMAS PUDDING TRINKET POT |
5222 | 536847 | 84347 | ROTATING SILVER ANGELS T-LIGHT HLDR | 6 | 2018-12-01 09:31:00 | 2.55 | 17135 | 2018 | 12 | 2018-12 | 48 | 2018-Week-48 | 2018-12-01 | 5 | Saturday | 15.30 | ROTATING SILVER ANGELS T-LIGHT HLDR | 84347__ROTATING SILVER ANGELS T-LIGH... |
5223 | 536847 | 21231 | SWEETHEART CERAMIC TRINKET BOX | 24 | 2018-12-01 09:31:00 | 1.25 | 17135 | 2018 | 12 | 2018-12 | 48 | 2018-Week-48 | 2018-12-01 | 5 | Saturday | 30.00 | SWEETHEART CERAMIC TRINKET BOX | 21231__SWEETHEART CERAMIC TRINKET BOX |
# calculating median price of A-class and datasets on data containing prices above and below it
= a_class_units_entries_12m['unit_price'].median()
a_class_median_price print(f'\033[1mMedian price of A-class products: {round(a_class_median_price, 1)}\033[0m')
= a_class_units_entries_12m.copy().query('unit_price >= @a_class_median_price')
a_class_price_above_median = a_class_units_entries_12m.copy().query('unit_price < @a_class_median_price') a_class_price_below_median
Median price of A-class products: 2.5
=a_class_price_above_median, parameter='quantity', x_limits=[0,70], title_extension='', bins=[1000, 4000], speed_up_plotting=True, outliers_info=False) distribution_IQR(df
Note: A sample data slice 6% of "a_class_price_above_median" was used for histogram plotting instead of the full DataFrame.
This significantly reduced plotting time for the large dataset. The accuracy of the visualization might be slightly reduced, meanwhile it should be sufficient for exploratory analysis.
==================================================
Statistics on quantity
in a_class_price_above_median
count 155985.00
mean 5.87
std 21.20
min 1.00
25% 1.00
50% 2.00
75% 6.00
max 1930.00
Name: quantity, dtype: float64
--------------------------------------------------
The distribution is extremely skewed to the right
(skewness: 26.7)
Note: outliers affect skewness calculation
==================================================
=a_class_price_below_median, parameter='quantity', x_limits=[0,70], title_extension='', bins=[1000, 4000], speed_up_plotting=True, outliers_info=False) distribution_IQR(df
Note: A sample data slice 6% of "a_class_price_below_median" was used for histogram plotting instead of the full DataFrame.
This significantly reduced plotting time for the large dataset. The accuracy of the visualization might be slightly reduced, meanwhile it should be sufficient for exploratory analysis.
==================================================
Statistics on quantity
in a_class_price_below_median
count 155585.00
mean 16.61
std 55.98
min 1.00
25% 2.00
50% 10.00
75% 12.00
max 4800.00
Name: quantity, dtype: float64
--------------------------------------------------
The distribution is extremely skewed to the right
(skewness: 26.6)
Note: outliers affect skewness calculation
==================================================
As a next step, we will evaluate the share of A-class products priced above and below median by quantity sold and revenue generated from the total A-class products.
=True, show_period=False) share_evaluation(a_class_price_above_median, a_class_units_entries_12m, show_qty_rev
======================================================================================================================================================
Evaluation of share: a_class_price_above_median
in a_class_units_entries_12m
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 155985 (50.1% of all entries)
Quantity: 916110 (26.2% of the total quantity)
Revenue: 4365494.0 (57.4% of the total revenue)
======================================================================================================================================================
=True, show_period=False) share_evaluation(a_class_price_below_median, a_class_units_entries_12m, show_qty_rev
======================================================================================================================================================
Evaluation of share: a_class_price_below_median
in a_class_units_entries_12m
------------------------------------------------------------------------------------------------------------------------------------------------------
Number of entries: 155585 (49.9% of all entries)
Quantity: 2584470 (73.8% of the total quantity)
Revenue: 3246461.6 (42.6% of the total revenue)
======================================================================================================================================================
Observations
The distribution_IQR
function’s histograms, boxplots, and descriptive statistics clearly show that price has a significant impact on the quantity sold. The median quantity values for a_class_above_median
and a_class_below_median
differ by five times: 2 and 10, respectively.
The data is not normally distributed. Both distributions of quantity sold (for products above and below the median price) are heavily skewed to the right, indicating a strong difference in sales patterns.
The summary from the share_evaluation
function shows that products above the median price account for about 26% of the total quantity sold and 57% of the total revenue within this class. In contrast, products below the median price have a higher sales volume, making up about 74% of the total quantity while generating only 43% of the total revenue for this group.
Based on these figures and observations, we can confidently state that our Alternative hypothesis is true: there is a significant difference in the average quantity sold between products priced above and below the median price for A-class products.
The practical significance of these findings is following:
Considering the non-normal distributions, we could run a Mann-Whitney U test to compare the groups. However, given our observations, it seems unnecessary. The difference between the samples is already clear and significant, and the practical importance is evident.
The hypotheses:
# filtering entries of old products only
= df_ecom_filtered_12m.copy().query('stock_code_description not in @new_products_list_12m') df_ecom_filtered_12m_old_products
#share_evaluation(df_ecom_filtered_12m_new_products_only, df_ecom_filtered_3m, show_qty_rev=True, show_period=True, show_example=True, example_type='head')
# getting daily summary for all products in the last 3 month
= df_ecom_filtered_12m.query('invoice_year_month in @last_3_months').groupby('invoice_day').agg({
daily_products_3m 'quantity' :'sum',
'revenue' :'sum',
'stock_code_description' : 'nunique'
}).reset_index()= daily_products_3m.rename(columns = {'stock_code_description' : 'unique_products'})
daily_products_3m
# getting daily summary for new products in the last 3 month
= df_ecom_filtered_12m_new_products_only.groupby('invoice_day').agg({
daily_new_products 'quantity':'sum',
'revenue' : 'sum',
'stock_code_description' : 'nunique'
}).reset_index()= daily_new_products.rename(columns = {'stock_code_description' : 'unique_products'})
daily_new_products
# getting daily summary for old products in the last 3 month
= df_ecom_filtered_12m_old_products.query('invoice_year_month in @last_3_months').groupby('invoice_day').agg({
daily_old_products_3m 'quantity': 'sum',
'revenue': 'sum',
'stock_code_description': 'nunique'
}).reset_index()= daily_old_products_3m.rename(columns = {'stock_code_description' : 'unique_products'})
daily_old_products_3m
print(f'\033[1mTop 3 rows of the daily summaries in the last 3 months:\033[0m\n')
print('All products:')
3)
daily_products_3m.head(print('New products:')
3)
daily_new_products.head(print('Old products:')
3) daily_old_products_3m.head(
Top 3 rows of the daily summaries in the last 3 months:
All products:
invoice_day | quantity | revenue | unique_products | |
---|---|---|---|---|
0 | 2019-09-02 | 10911 | 16878.74 | 820 |
1 | 2019-09-03 | 22722 | 36276.35 | 881 |
2 | 2019-09-04 | 15058 | 27998.06 | 704 |
New products:
invoice_day | quantity | revenue | unique_products | |
---|---|---|---|---|
0 | 2019-09-02 | 92 | 288.07 | 14 |
1 | 2019-09-03 | 96 | 171.54 | 14 |
2 | 2019-09-04 | 162 | 561.43 | 21 |
Old products:
invoice_day | quantity | revenue | unique_products | |
---|---|---|---|---|
0 | 2019-09-02 | 10819 | 16590.67 | 806 |
1 | 2019-09-03 | 22626 | 36104.81 | 867 |
2 | 2019-09-04 | 14896 | 27436.63 | 683 |
# getting daily summary for all products in the last 3 month
= df_ecom_filtered_12m.query('invoice_year_month in @last_3_months').groupby('invoice_day').agg({
daily_products_3m 'quantity' :'sum',
'revenue' :'sum',
'stock_code_description' : 'nunique'
}).reset_index()= daily_products_3m.rename(columns = {'stock_code_description' : 'unique_products'})
daily_products_3m
# getting daily summary for new products in the last 3 month
= df_ecom_filtered_12m_new_products_only.groupby('invoice_day').agg({
daily_new_products 'quantity':'sum',
'revenue' : 'sum',
'stock_code_description' : 'nunique'
}).reset_index()= daily_new_products.rename(columns = {'stock_code_description' : 'unique_products'})
daily_new_products
# getting daily summary for old products in the last 3 month
= df_ecom_filtered_12m_old_products.query('invoice_year_month in @last_3_months').groupby('invoice_day').agg({
daily_old_products_3m 'quantity': 'sum',
'revenue': 'sum',
'stock_code_description': 'nunique'
}).reset_index()= daily_old_products_3m.rename(columns = {'stock_code_description' : 'unique_products'})
daily_old_products_3m
print(f'\033[1mTop 3 rows of the daily summaries in the last 3 months:\033[0m\n')
print('All products:')
3)
daily_products_3m.head(print('New products:')
3)
daily_new_products.head(print('Old products:')
3) daily_old_products_3m.head(
Top 3 rows of the daily summaries in the last 3 months:
All products:
invoice_day | quantity | revenue | unique_products | |
---|---|---|---|---|
0 | 2019-09-02 | 10911 | 16878.74 | 820 |
1 | 2019-09-03 | 22722 | 36276.35 | 881 |
2 | 2019-09-04 | 15058 | 27998.06 | 704 |
New products:
invoice_day | quantity | revenue | unique_products | |
---|---|---|---|---|
0 | 2019-09-02 | 92 | 288.07 | 14 |
1 | 2019-09-03 | 96 | 171.54 | 14 |
2 | 2019-09-04 | 162 | 561.43 | 21 |
Old products:
invoice_day | quantity | revenue | unique_products | |
---|---|---|---|---|
0 | 2019-09-02 | 10819 | 16590.67 | 806 |
1 | 2019-09-03 | 22626 | 36104.81 | 867 |
2 | 2019-09-04 | 14896 | 27436.63 | 683 |
# checking number of days covered
len(daily_new_products)
len(daily_old_products_3m)
78
78
#share_evaluation(daily_new_products, daily_products_3m, show_qty_rev=True, show_example=False, example_type='head')
# creating necessary columns, handling possible issues with dividing by zeros
'avg_qty_per_product'] = daily_new_products['quantity'].div(daily_new_products['unique_products'], fill_value=0)
daily_new_products['avg_rev_per_product'] = daily_new_products['revenue'].div(daily_new_products['unique_products'], fill_value=0)
daily_new_products[
'avg_qty_per_product'] = daily_old_products_3m['quantity'].div(daily_old_products_3m['unique_products'], fill_value=0)
daily_old_products_3m['avg_rev_per_product'] = daily_old_products_3m['revenue'].div(daily_old_products_3m['unique_products'], fill_value=0)
daily_old_products_3m[
print(f'\033[1mTop 3 rows of the daily summaries in the last 3 months:\033[0m\n')
print('New products:')
3)
daily_new_products.head(print('Old products:')
3) daily_old_products_3m.head(
Top 3 rows of the daily summaries in the last 3 months:
New products:
invoice_day | quantity | revenue | unique_products | avg_qty_per_product | avg_rev_per_product | |
---|---|---|---|---|---|---|
0 | 2019-09-02 | 92 | 288.07 | 14 | 6.57 | 20.58 |
1 | 2019-09-03 | 96 | 171.54 | 14 | 6.86 | 12.25 |
2 | 2019-09-04 | 162 | 561.43 | 21 | 7.71 | 26.73 |
Old products:
invoice_day | quantity | revenue | unique_products | avg_qty_per_product | avg_rev_per_product | |
---|---|---|---|---|---|---|
0 | 2019-09-02 | 10819 | 16590.67 | 806 | 13.42 | 20.58 |
1 | 2019-09-03 | 22626 | 36104.81 | 867 | 26.10 | 41.64 |
2 | 2019-09-04 | 14896 | 27436.63 | 683 | 21.81 | 40.17 |
'avg_qty_per_product', title_extension='', bins=[10,40], speed_up_plotting=False, outliers_info=False) distribution_IQR(daily_new_products,
==================================================
Statistics on avg_qty_per_product
in daily_new_products
count 78.00
mean 18.29
std 12.02
min 4.94
25% 11.31
50% 15.18
75% 19.32
max 64.84
Name: avg_qty_per_product, dtype: float64
--------------------------------------------------
The distribution is highly skewed to the right
(skewness: 2.1)
Note: outliers affect skewness calculation
==================================================
'avg_qty_per_product', title_extension='', bins=[10,40], speed_up_plotting=False, outliers_info=False) distribution_IQR(daily_old_products_3m,
==================================================
Statistics on avg_qty_per_product
in daily_old_products_3m
count 78.00
mean 21.95
std 7.29
min 7.20
25% 17.59
50% 20.75
75% 25.73
max 49.29
Name: avg_qty_per_product, dtype: float64
--------------------------------------------------
The distribution is moderately skewed to the right
(skewness: 0.9)
Note: outliers affect skewness calculation
==================================================
'avg_qty_per_product', alpha=0.05, descriptive_stat=True, histogram=True) testing_averages(daily_new_products, daily_old_products_3m,
Testing averages of “avg_qty_per_product” in “daily_new_products” and “daily_old_products_3m”
====================================================================================================
Mann-Whitney U test performed (as one or both samples are not normal)
----------------------------------------------------------------------------------------------------
P-value: 0.000
Reject the null hypothesis (H0): there are significant differences between the groups.
----------------------------------------------------------------------------------------------------
Descriptive statistics:
median of "avg_qty_per_product" in "daily_new_products": 15.2
median of "avg_qty_per_product" in "daily_old_products_3m": 20.8
The relative difference in medians: 36.8%
(increase from "avg_qty_per_product" in "daily_new_products" to "avg_qty_per_product" "daily_old_products_3m")
Variance of "avg_qty_per_product" in "daily_new_products": 142.5
Variance of "avg_qty_per_product" in "daily_old_products_3m": 52.4
Standard Deviation of "avg_qty_per_product" in "daily_new_products": 11.9
Standard Deviation of "avg_qty_per_product" in "daily_old_products_3m": 7.2
----------------------------------------------------------------------------------------------------
====================================================================================================
Observations
The distribution_IQR
function’s histograms, boxplots, and descriptive statistics clearly show that established products exhibit more predictable sales volumes compared to newly introduced items, which display greater sales variability. Specifically, for the avg_qty_per_product
metric:
The Mann-Whitney U test indicates a statistically significant difference between the average quantity per product for new and established products.
Based on these findings, we can confidently conclude that the Alternative Hypothesis is supported: there is a significant difference in the average daily sales between newly introduced products and established products.
💡 In practice, this means that products generally experience increased sales over time, with established products showing more consistent and higher average quantities sold per product. This highlights the importance of allowing products enough time to mature in the market before: 1) making critical decisions (e.g., withdrawal from the assortment), and 2) assessing them like other products. This supports our previous decision to flag new products in the context of ABC-XYZ analysis.
invoice_no
) and customer IDs (customer_id
) contain non-integer values.invoice_date
was converted from an object to datetime for better time-based analysis.customer_id
contains 25% missing values, while description
has 0.3% missing values.customer_id
was missing, converting these values to zeros for proper data processing.quantity
values (2% of entries) were retained for further analysis, as they could indicate product returns.unit_price
were removed (only two cases).revenue
column for revenue analysis.At this stage, we focused on quantity, unit price, and revenue, aiming to understand data distributions, spot outliers, and analyze atypical entries. The goal was to extract insights that would be valuable for the next steps in our study.
# examination of quantity totals and distributions of 10 top-selling products
'stock_code_description', 'quantity', show_outliers=False, fig_height=500, n_items=10) plot_totals_distribution(df_ecom_filtered,
In this stage, we examined sales trends over time, focusing on seasonality, anomalies, and long-term trends.
# creating line plots - for each parameter's absolute change
# defining the colors
= {
colors 'revenue': 'darkred',
'quantity': 'teal',
'unique_invoices': 'navy',
'unique_products': 'purple',
'unique_customers': 'darkgreen',
'unit_price_mean': 'darkgoldenrod',
'unit_price_median': 'darkorange',
'revenue_mean': 'crimson',
'revenue_median': 'darkred',
'quantity_mean': 'darkseagreen',
'quantity_median': 'teal'}
= go.Figure()
fig
# adding traces
for parameter in parameters:
= colors.get(parameter, 'gray') # Default to gray if parameter not in colors dict
color
fig.add_trace(go.Scatter(=monthly_summary['invoice_year_month'],
x=monthly_summary[f'{parameter}_absolute_change_pct'],
y='lines+markers',
mode=f'{parameter}',
name=dict(size=8, color=color),
marker=dict(width=2, color=color),
line='<b>%{x}</b><br>' +
hovertemplatef'Parameter: {parameter} Absolute Change<br>' +
'Value: %{y:.2f}%<extra></extra>' )) # hiding secondary box in hover labels
for m_parameter in m_parameters:
= colors.get(m_parameter, 'gray') # Default to gray if parameter not in colors dict
color
fig.add_trace(go.Scatter(=monthly_invoices_summary['invoice_year_month'],
x=monthly_invoices_summary[f'{m_parameter}_absolute_change_pct'],
y='lines+markers',
mode=f'invoice_{m_parameter}',
name=dict(size=8, symbol='diamond', color=color),
marker=dict(width=2, dash='dot', color=color),
line='<b>%{x}</b><br>' +
hovertemplatef'Parameter: invoice_{m_parameter} Absolute Change<br>' +
'Value: %{y:.2f}%<extra></extra>')) # hiding secondary box in hover labels
# adding annotations for the milestones
= 0
milestone_number for milestone in ['2019-02','2019-08']:
+= 1
milestone_number = f'Milestone {milestone_number}'
milestone_title = datetime.strptime(milestone, '%Y-%m') - timedelta(days=5)
milestone_date
fig.add_annotation(=milestone_title,
text='y',
yref=milestone_date, y=140, textangle=-90,
x=False,
showarrow=dict(size=14, color='gray'))
font
fig.update_layout(={'text': 'Absolute Changes in Parameters by Month', 'font_size': 20,'y': 0.92, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'},
title='Month',
xaxis_title='Absolute Change (%)',
yaxis_title=-45,
xaxis_tickangle=dict(showgrid=True),
yaxis=True,
showlegend# legend={'y': 0.97, 'x': 0.03},
=1400,
width=900)
height
=0, line_color='darkgray', line_width=2, line_dash='solid')
fig.add_hline(yfor milestone in ['2019-02','2019-08']:
=milestone, line_color='darkgray', line_width=2, line_dash='dash')
fig.add_vline(x; fig.show()
# plotting totals and relevant distributions for revenue by day of week
'invoice_day_name', 'revenue', show_outliers=False, title_start=False, plot_totals=True, plot_distribution=True, fig_height=500, consistent_colors=True) plot_totals_distribution(daily_summary_12m,
# plotting a line plot of distribution distribution of invoices by week
= go.Figure()
fig
fig.add_trace(go.Scatter(=weekly_invoices['invoice_year_week'],
x=weekly_invoices['unique_invoices'],
y='lines+markers',
mode='navy',
line_color='Weekly Invoices'))
name
fig.update_layout( ={'text': 'Invoices by Week', 'font_size': 20, 'y': 0.9, 'x': 0.5},
title='Week',
xaxis_title='Invoices',
yaxis_title=1100,
width=600,
height=dict(tickangle=-45))
xaxis
# adding markers highlighting peaks of orders
= ['2018-Week-49', '2019-Week-46']
peak_weeks = weekly_invoices[weekly_invoices['invoice_year_week'].isin(peak_weeks)]
peak_data
fig.add_trace(go.Scatter(=peak_data['invoice_year_week'],
x=peak_data['unique_invoices'],
y='markers',
mode=dict(color='green', size=100, symbol='circle-open',
marker=dict(color='green', width=1)),
line='Peak Weeks'))
name
for week in peak_weeks:
=week, line_color='green', line_width=1, line_dash='dash')
fig.add_vline(x
; fig.show()
The correlation analysis has confirmed our findings from the previous Time-based Analysis stage, for instance - strong correlation between the number of unique customers and unique products sold.
We quantified the relationships, showing a more significant dependency of median invoice quantity on time (year-month) compared to median invoice revenue.
In particular, we proved that:
We classified products by sales revenue (ABC classification) and demand variability (XYZ classification) to improve inventory management and guide business development (e.g., focusing promotions on high-value products and considering removal of underperformers).
We excluded returns entries, analyzing them separately.
We included new products, as they significantly contributed to sales, flagging them for separate analysis.
ABC-XYZ classification findings are following
To summarize the ABC-XYZ classes performance we will create two comprehensive visualizations: - Pareto diagrams for Revenue and Quantity contributions by ABC-XYZ class. - A combined graph displaying key metrics, including Revenue, Quantity, Stock Code Percentages, and CoV Quantity by ABC-XYZ class.
# creating separate DataFrames for quantity and revenue Pareto analyses
= df_abc_xyz_summary.sort_values('quantity', ascending=False).copy()
df_quantity 'cumulative_units_pct'] = df_quantity['quantity'].cumsum() / df_quantity['quantity'].sum()
df_quantity[
= df_abc_xyz_summary.sort_values('revenue', ascending=False).copy()
df_revenue 'cumulative_revenue_pct'] = df_revenue['revenue'].cumsum() / df_revenue['revenue'].sum()
df_revenue[
# creating a subplot with two columns
= make_subplots(rows=1, cols=2, specs=[[{'secondary_y': True}, {'secondary_y': True}]],
fig =('Revenue Contribution', 'Quantity Contribution'),
subplot_titles=0.15)
horizontal_spacing
# right plot for quantity
fig.add_trace(
go.Bar(=df_quantity['abc_xyz_class'],
x=df_quantity['quantity'],
y='Total Units',
name=round(df_quantity['quantity']),
text='outside',
textposition=df_quantity['quantity'],
marker_color='RdYlGn'),
marker_colorscale=1, col=2)
row
fig.add_trace(
go.Scatter(=df_quantity['abc_xyz_class'],
x=df_quantity['cumulative_units_pct'],
y='lines+markers',
mode='Cumulative % (Units)',
name=dict(color='red', width=2),
line=dict(size=8)),
marker=1, col=2,
row=True)
secondary_y
# right plot for revenue
fig.add_trace(
go.Bar(=df_revenue['abc_xyz_class'],
x=df_revenue['revenue'],
y='Total Revenue',
name=round(df_revenue['revenue']),
text='outside',
textposition=df_revenue['revenue'],
marker_color='RdYlGn'),
marker_colorscale=1, col=1)
row
fig.add_trace(
go.Scatter(=df_revenue['abc_xyz_class'],
x=df_revenue['cumulative_revenue_pct'],
y='lines+markers',
mode='Cumulative % (Revenue)',
name=dict(color='red', width=2),
line=dict(size=8)),
marker=1, col=1,
row=True)
secondary_y
fig.update_layout(={
title'text': 'Pareto Charts for Quantity and Revenue Contribution by ABC-XYZ Class',
'y':0.95,
'x':0.5},
=600,
height=1400,
width=False)
showlegend
="ABC-XYZ Class", row=1, col=1)
fig.update_xaxes(title_text="ABC-XYZ Class", row=1, col=2)
fig.update_xaxes(title_text="Total Revenue", secondary_y=False, row=1, col=1)
fig.update_yaxes(title_text="Cumulative %", secondary_y=True, tickformat='.0%', row=1, col=1)
fig.update_yaxes(title_text="Quantity", secondary_y=False, row=1, col=2)
fig.update_yaxes(title_text="Cumulative %", secondary_y=True, tickformat='.0%', row=1, col=2)
fig.update_yaxes(title_text; fig.show()
# adding new columns for percentages of totals
'quantity_pct'] = df_abc_xyz_summary['quantity'] / df_abc_xyz_summary['quantity'].sum()
df_abc_xyz_summary['revenue_pct'] = df_abc_xyz_summary['revenue'] / df_abc_xyz_summary['revenue'].sum()
df_abc_xyz_summary['quantity_pct'] = df_abc_xyz_summary['quantity'] / df_abc_xyz_summary['quantity'].sum()
df_abc_xyz_summary['stock_codes_pct'] = df_abc_xyz_summary['unique_products'] / df_abc_xyz_summary['unique_products'].sum()
df_abc_xyz_summary[= df_abc_xyz_summary.sort_values(by='abc_xyz_class')
df_abc_xyz_summary #df_abc_xyz_summary
# creating a combined graph for ABC-XYZ Classes
= make_subplots(specs=[[{'secondary_y': True}]])
fig
# adding data /traces to plots
for name, color in [('revenue_pct', 'darkred'),
'quantity_pct', 'teal'),
('stock_codes_pct', 'grey')]:
(
fig.add_trace(=df_abc_xyz_summary['abc_xyz_class'],
go.Bar(x=df_abc_xyz_summary[name], name=name,
y=color),secondary_y=False)
marker_color
# addin CoV quantity line
fig.add_trace(=df_abc_xyz_summary['abc_xyz_class'],
go.Scatter(x=df_abc_xyz_summary['cov_quantity'],
y='CoV Quantity',
name='lines+markers',
mode={'color': 'purple', 'width': 3},
line={'size': 8}),
marker=True)
secondary_y
fig.update_layout(={'text': 'Revenue, Quantity, Stock Codes Percentage and CoV Quantity by ABC-XYZ Class',
title'y': 0.95, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'},
='group',
barmode='ABC-XYZ Class',
xaxis_title={'orientation': 'h', 'yanchor': "bottom", 'y': 1.02,
legend'xanchor': "right", 'x': 1},
=550, width=1000)
height
= max(df_abc_xyz_summary[['revenue_pct', 'quantity_pct', 'stock_codes_pct']].max()) * 1.1 # extending the graphs height
max_pct ='Percentage', tickformat='.1%', range=[0, max_pct], secondary_y=False)
fig.update_yaxes(title_text='CoV Quantity', tickfont={'color': 'purple'},
fig.update_yaxes(title_text={'color': 'purple'}, tickcolor='purple', secondary_y=True)
titlefonttype='category', categoryorder='array',
fig.update_xaxes(=df_abc_xyz_summary['abc_xyz_class'].tolist())
categoryarray
; fig.show()
We defined inventory management and business development strategies tailored for ABC-XYZ classes (see the full Class - recommendations matrix below):
A (Premium) | B (Standard) | C (Basic) | |
---|---|---|---|
X (Stable) | AX Class | BX Class | CX Class |
🟥 Automation | - Automate replenishment | - Automate replenishment | - Automate replenishment |
🟦 Buffers | - Use low buffer inventory with supplier-held stock for supply security | - Maintain low buffer inventory with a safety-first approach | - Maintain low buffer inventory with a safety-first approach |
🟩 Inventory | - Implement real-time inventory tracking | - Conduct periodic counts for medium security | - Use free stock or periodic estimation via inspection/weighing |
🟡 M&S | - Adjust pricing often - Use best-possible media content, detailed product info and customers’ feedback - Actively invest in marketing campaigns |
- Tune prices regularly - Ensure good enough media content and clear descriptions - Run occasional marketing campaigns |
- Minimal pricing adjustments - Basic descriptions - Low marketing efforts, consider as complementary purchases |
🟣 PD | - Focus on unique features and continuous improvement | - Update based on customer demands | - Keep it simple, only essentials |
Y (Seasonal) | AY Class | BY Class | CY Class |
🟥 Automation | - Automate replenishment while allowing manual adjustments | - Automate replenishment while allowing manual adjustments | - Automate replenishment |
🟦 Buffers | - Accept stockout risks with low buffer inventory | - Adjust buffers manually for seasonality | - Maintain high buffer inventory for safety-first measures |
🟩 Inventory | - Implement real-time inventory tracking | - Conduct periodic counts for medium security | - Use free stock or periodic estimation via inspection/weighing |
🟡 M&S | - Adjust pricing based on seasonal demand - Launch exclusive seasonal promotions |
- Run limited-time promotions for niche markets - Market based on trends and demand shifts |
- Focus on wholesales and large seasonal sales |
🟣 PD | - Offer seasonal variations | - Tune to match seasonal trends | - Check whether are sold solely or in bigger purchases - Consider using them as complementary goods or withdrawing them |
Z (Irregular) | AZ Class | BZ Class | CZ Class |
🟥 Automation | - Operate on a buy to order basis | - Operate on a buy to order basis | - Automate replenishment |
🟦 Buffers | - Avoid buffers, ensure customers understand lead times | - Avoid buffers, ensure customers understand lead times | - Maintain high buffer inventory for safety-first measures |
🟩 Inventory | - Do not stock these products | - Do not stock these products | - Use free stock or periodic estimation via inspection/weighing |
🟡 M&S | - Adjust prices on occasions - Focus on sales for high-value customers |
- Keep pricing flexible and consultative - Target niche customers |
- Depends on overall performance trends* |
🟣 PD | - Provide custom solutions based on customer needs | - Provide only low-effort custom solutions | - Depends on overall performance trends* |
Note: ABC analysis works best when the Pareto principle (80/20 rule) holds, which is the case in our study. However, when long-tail effects dominate (where revenue is spread across many lower-performing items instead of a few top-sellers), ABC-XYZ recommendations must be adjusted.
In a strict Pareto scenario, low-performing products (C-Class), especially with irregular demand (Y and Z classes), are typically candidates for replacement or withdrawal. If long-tail effects are more prominent, the focus should shift to efficient inventory management and maintaining a diverse product range, even for lower performers. Our time-based analysis suggests an increasing long-tail effect, while the Pareto rule still generally holds.
Returns analysis focused on mutually exclusive entries with negative quantities, though actual return volume may be higher due to returns processed outside defined rules.
We introduced two metrics “returns rate” and “returns loss rate”, where return rate describes the percentage of entries representing returns from the total number of entries and returns loss rate describes the share of returns from the total sales.
# plotting the bubble chart for ABC-XYZ & returns analysis
= px.scatter(
fig
df_abc_xyz_returns_summary,='revenue',
x='quantity',
y='revenue',
size='revenue',
color='RdYlGn',
color_continuous_scale='abc_xyz_return_class',
hover_name='abc_xyz_return_class',
text='ABC-XYZ & Returns Analysis: Bubble Chart of Quantity vs. Revenue')
title
fig.update_layout(=650,
height=650,
width=0.5,
title_x=0.9)
title_y='middle left')
fig.update_traces(textposition; fig.show()
EDA insights suggest pricing, unique products, and customer base influence revenue more than returns.
Combining ABC-XYZ with returns analysis can improve decision-making:
New products are defined as those, that experienced sales in the last three months, but never before.
💡💡 The business has evolved into a volume-based growth strategy rather than a price-driven one, focusing on expanding the product range, attracting new customers, and maintaining stable or slightly decreasing prices.
As a result, the business achieved ~153% growth in sales volume and ~118-121% growth in revenue, invoices, and customer base.
💡💡 We identified two distinct growth drivers:
💡💡 Products succeed in different ways:
💡💡 The data quality presented significant challenges:
Data preparation was crucial. Simply removing negative quantities or ignoring naming inconsistencies could have led to misclassifications. For instance, many identical actively sold products had non-identical descriptions, and many cases involved paired purchase-return entries, affecting product categorization if not addressed.
Revenue does not equal Profit. Since product-level profit data is unavailable, the true impact of growth remains uncertain. Revenue increases could be driven by high promotional costs and/or substantial discounts, affecting profitability. A complete analysis would require access to margin and cost data.
Executive summary: Our analysis identifies key opportunities to enhance profitability through improved inventory management, targeted product development, optimized pricing and marketing activities. These recommendations are based on established analytical frameworks that enable easy analysis replication on fresh data to track progress.
We’ve developed a comprehensive Inventory Management & Product Development Action Matrix that outlines specific policies for each product category. The examples from the matrix include:
Note: If requested, we can enhance our ABC-XYZ analysis by adding extra criteria such as quantity sold and invoice frequency, creating classifications like AAAZ (high revenue, large quantities, frequent invoices, unstable demand). This modification will allow more precise marketing and inventory management policies.
“Seaborn and Matplotlib Visualization Guide” Python Graph Gallery: https://python-graph-gallery.com/
This visualization resource helped me choose the most suitable data visualizations and color palettes to effectively communicate findings.
“Applied Time Series Analysis with Python: Forecasting, Modeling, and Seasonality Detection” Python Data Science Handbook: https://jakevdp.github.io/PythonDataScienceHandbook/
This resource helped me implement time series analysis for identifying sales patterns, particularly seasonal trends, and provided text annotation techniques that enhanced visualizations.
“Text Mining and Natural Language Processing with NLTK” NLTK Documentation: https://www.nltk.org/book/
This resource was valuable for text analysis of product descriptions when studying and addressing naming issues. I particularly utilized Regular Expressions for detecting word patterns and text methods like lower() and split().
“Goodbye Pareto Principle, Hello Long Tail: The Effect of Search Costs on the Concentration of Product Sales” MIT Sloan School of Management: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=953587
This research paper helped me understand how the traditional Pareto principle might evolve in e-commerce, informing our portfolio expansion recommendations and balancing of growth strategies.
“A Conceptual Model Based on Pareto Principle and Long Tail for Online and Offline Markets” Business Studies Journal: https://www.abacademies.org/articles/a-conceptual-model-based-on-pareto-principle-and-long-tail-for-online-and-offline-markets-14477.html
Similarly to the previous source, this article provided insights on how to balance between focusing on high-performing products and expanding product range, directly supporting our “Balance Growth Strategies” section of recommendations.
“ABC Inventory: Get the Most Out of Your Best-Selling Products” Katana MRP Resource Center: https://katanamrp.com/abc-inventory/ This resource provided practical insights on optimizing inventory for best-selling products, supporting our recommendations for high-value A-class items and implementing safety stock strategies.
“DataWiz - Inventory Classification Methods” (in Russian) Habr Technical Blog: https://habr.com/ru/companies/datawiz/articles/269167/
This technical blog post offered alternative perspectives on inventory classification methods that helped refine our approach to the ABC-XYZ analysis, particularly for products with irregular demand patterns.
“How to Create an ABC XYZ Inventory Classification Model” Practical Data Science Portal: https://web.archive.org/web/20240518062749/https://practicaldatascience.co.uk/data-science/how-to-create-an-abc-xyz-inventory-classification-model
This technical guide offered step-by-step instructions for implementing the ABC-XYZ model using data science techniques, which informed our methodology and ensured replicability of our analysis framework. We captured the main ideas for practical implementation of ABC-XYZ analysis by use of Python, meanwhile enhanced the study methodology and developed our own way of insights visualization.
“ABC-XYZ Inventory Management” Association of International Certified Professional Accountants: https://web.archive.org/web/20230208135403/https://www.cgma.org/resources/tools/cost-transformation-model/abc-xyz-inventory-management.html
This professional resource provided the comprehensive perspective on inventory classification. We adopted and enriched their ABC-XYZ action matrix (containing Inventory Management polices) to develop our Inventory Management & Product Development Action Matrix where we also added Marketing & Sales and Product Development polices for each class.