Create a summary of each customer over a calibration and holdout period.
This function creates a summary of each customer over a calibration and
holdout period (training and testing, respectively).
It accepts transaction data, and returns a DataFrame of sufficient statistics.
Parameters:
transactions – a Pandas DataFrame that contains the customer_id col and the datetime col.
customer_id_col (string) – the column in transactions DataFrame that denotes the customer_id
datetime_col (string) – the column in transactions that denotes the datetime the purchase was made.
calibration_period_end – a period to limit the calibration to, inclusive.
observation_period_end – a string or datetime to denote the final date of the study.
Events after this date are truncated. If not given, defaults to the max ‘datetime_col’.
freq_multiplier (int, optional) – Default: 1. Useful for getting exact recency & T. Example:
With freq=’D’ and freq_multiplier=1, we get recency=591 and T=632
With freq=’h’ and freq_multiplier=24, we get recency=590.125 and T=631.375
datetime_format (string, optional) – a string that represents the timestamp format. Useful if Pandas can’t understand
the provided format.
monetary_value_col (string, optional) – the column in transactions that denotes the monetary value of the transaction.
Optional, only needed for customer lifetime value estimation models.
include_first_transaction (bool, optional) – Default: False
By default the first transaction is not included while calculating frequency and
monetary_value. Can be set to True to include it.
Should be False if you are going to use this data with any fitters in BTYD package
Returns:
A dataframe with columns frequency_cal, recency_cal, T_cal, frequency_holdout, duration_holdout
If monetary_value_col isn’t None, the dataframe will also have the columns monetary_value_cal and
monetary_value_holdout.
Get expected and actual repeated cumulative transactions.
Uses the expected_number_of_purchases_up_to_time() method from the fitted model
to predict the cumulative number of purchases.
This function follows the formulation on page 8 of [1]_.
In more detail, we take only the customers who have made their first
transaction before the specific date and then multiply them by the distribution of the
expected_number_of_purchases_up_to_time() for their whole future. Doing that for
all dates and then summing the distributions will give us the complete cumulative
purchases.
Parameters:
model – A fitted BTYD model
transactions – a Pandas DataFrame containing the transactions history of the customer_id
datetime_col (string) – the column in transactions that denotes the datetime the purchase was made.
customer_id_col (string) – the column in transactions that denotes the customer_id
t (int) – the number of time units since the begining of
data for which we want to calculate cumulative transactions
datetime_format (string, optional) – a string that represents the timestamp format. Useful if Pandas can’t
understand the provided format.
freq_multiplier (int, optional) – Default: 1. Useful for getting exact recency & T. Example:
With freq=’D’ and freq_multiplier=1, we get recency=591 and T=632
With freq=’h’ and freq_multiplier=24, we get recency=590.125 and T=631.375
set_index_date (bool, optional) – when True set date as Pandas DataFrame index, default False - number of time units
This transforms a DataFrame of transaction data of the form:
customer_id, datetime [, monetary_value]
to a DataFrame of the form:
customer_id, frequency, recency, T [, monetary_value]
Parameters:
transactions – a Pandas DataFrame that contains the customer_id col and the datetime col.
customer_id_col (string) – the column in transactions DataFrame that denotes the customer_id
datetime_col (string) – the column in transactions that denotes the datetime the purchase was made.
monetary_value_col (string, optional) – the columns in the transactions that denotes the monetary value of the transaction.
Optional, only needed for customer lifetime value estimation models.
observation_period_end (datetime, optional) – a string or datetime to denote the final date of the study.
Events after this date are truncated. If not given, defaults to the max ‘datetime_col’.
datetime_format (string, optional) – a string that represents the timestamp format. Useful if Pandas can’t understand
the provided format.
freq_multiplier (int, optional) – Default: 1. Useful for getting exact recency & T. Example:
With freq=’D’ and freq_multiplier=1, we get recency=591 and T=632
With freq=’h’ and freq_multiplier=24, we get recency=590.125 and T=631.375
include_first_transaction (bool, optional) – Default: False
By default the first transaction is not included while calculating frequency and
monetary_value. Can be set to True to include it.
Should be False if you are going to use this data with any fitters in BTYD package
Returns:
customer_id, frequency, recency, T [, monetary_value]
Also known as the Beta-Geometric/Beta-Binomial Model [1]_.
Future purchases opportunities are treated as discrete points in time.
In the literature, the model provides a better fit than the Pareto/NBD
model for a nonprofit organization with regular giving patterns.
The model is estimated with a recency-frequency matrix with n transaction
opportunities.
Parameters:
penalizer_coef (float) – The coefficient applied to an l2 norm on the parameters
Conditional expected purchases in future time period.
The expected number of future transactions across the next m_periods_in_future
transaction opportunities by a customer with purchase history
(x, tx, n).
frequency (array_like) – Total periods with observed transactions
recency (array_like) – Period of most recent transaction
n_periods (array_like) – Number of transaction opportunities. Previously called n.
weights (None or array_like) – Number of customers with given frequency/recency/T,
defaults to 1 if not specified. Fader and
Hardie condense the individual RFM matrix into all
observed combinations of frequency/recency/T. This
parameter represents the count of customers with a given
purchase pattern. Instead of calculating individual
log-likelihood, the log-likelihood is calculated for each
pattern and multiplied by the number of customers with
that pattern. Previously called n_custs.
verbose (boolean, optional) – Set to true to print out convergence diagnostics.
tol (float, optional) – Tolerance for termination of the function minimization process.
index (array_like, optional) – Index for resulted DataFrame which is accessible via self.data
kwargs – Key word arguments to pass to the scipy.optimize.minimize
function as options dict
Also known as the BG/NBD model.
Based on [2] and [3], this model has the following assumptions:
1) Each individual, i, has a hidden lambda_i and p_i parameter
2) These come from a population wide Gamma and a Beta distribution
respectively.
Individuals purchases follow a Poisson process with rate lambda_i*t .
After each purchase, an individual has a p_i probability of dieing
(never buying again).
Parameters:
penalizer_coef (float) – The coefficient applied to an l2 norm on the parameters
Conditional expected number of purchases up to time.
Calculate the expected number of repeat purchases up to time t for a
randomly choose individual from the population, given they have
purchase history (frequency, recency, T)
:param t: times to calculate the expectation for.
:type t: array_like
:param frequency: historical frequency of customer.
:type frequency: array_like
:param recency: historical recency of customer.
:type recency: array_like
:param T: age of the customer.
:type T: array_like
:param X_tr: n * d1 matrix containing covariates representing
time-invariant user characteristics affecting the Transaction Rate.
d1 as number of covariates and n as number of users.
Parameters:
X_do (array_like) – n * d2 matrix containing covariates representing
time-invariant user characteristics affecting the Drop Out.
d2 as number of covariates and n as number of users.
Compute conditional probability alive.
Compute the probability that a customer with history
(frequency, recency, T) is currently alive.
From http://www.brucehardie.com/notes/021/palive_for_BGNBD.pdf
:param frequency: historical frequency of customer.
:type frequency: float
:param recency: historical recency of customer.
:type recency: float
:param T: age of the customer.
:type T: float
:param X_tr: n * d1 matrix containing covariates representing
time-invariant user characteristics affecting the Transaction Rate.
d1 as number of covariates and n as number of users.
Parameters:
X_do (array_like) – n * d2 matrix containing covariates representing
time-invariant user characteristics affecting the Drop Out.
d2 as number of covariates and n as number of users.
ln_exp_max (int) – to what value clip log_div equation
Compute the probability alive matrix.
:param X_tr: n * d1 matrix containing covariates representing
time-invariant user characteristics affecting the Transaction Rate.
d1 as number of covariates and n as number of users.
Parameters:
X_do (array_like) – n * d2 matrix containing covariates representing
time-invariant user characteristics affecting the Drop Out.
d2 as number of covariates and n as number of users.
max_frequency (float, optional) – the maximum frequency to plot. Default is max observed frequency.
max_recency (float, optional) – the maximum recency to plot. This also determines the age of the
customer. Default to max observed age.
Returns:
A matrix of the form [t_x: historical recency, x: historical frequency]
Calculate the expected number of repeat purchases up to time t.
Calculate repeat purchases for a randomly choose individual from the
population.
:param t: times to calculate the expection for
:type t: array_like
:param X_tr: n * d1 matrix containing covariates representing
time-invariant user characteristics affecting the Transaction Rate.
d1 as number of covariates and n as number of users.
Parameters:
X_do (array_like) – n * d2 matrix containing covariates representing
time-invariant user characteristics affecting the Drop Out.
d2 as number of covariates and n as number of users.
Fit a dataset to the BG/NBD model.
:param frequency: the frequency vector of customers’ purchases
(denoted x in literature).
Parameters:
recency (array_like) – the recency vector of customers’ purchases
(denoted t_x in literature).
T (array_like) – customers’ age (time units since first purchase)
X_tr (array_like) – n * d1 matrix containing covariates representing
time-invariant user characteristics affecting the Transaction Rate.
d1 as number of covariates and n as number of users.
X_do (array_like) – n * d2 matrix containing covariates representing
time-invariant user characteristics affecting the Drop Out.
d2 as number of covariates and n as number of users.
weights (None or array_like) – Number of customers with given frequency/recency/T,
defaults to 1 if not specified. Fader and
Hardie condense the individual RFM matrix into all
observed combinations of frequency/recency/T. This
parameter represents the count of customers with a given
purchase pattern. Instead of calculating individual
loglikelihood, the loglikelihood is calculated for each
pattern and multiplied by the number of customers with
that pattern.
initial_params (array_like, optional) – set the initial parameters for the fitter.
verbose (bool, optional) – set to true to print out convergence diagnostics.
tol (float, optional) – tolerance for termination of the function minimization process.
index (array_like, optional) – index for resulted DataFrame which is accessible via self.data
kwargs – key word arguments to pass to the scipy.optimize.minimize
function as options dict
Returns:
with additional properties like params_ and methods like predict
probability_of_n_purchases_up_to_time(t, n, X_tr, X_do)#
Compute the probability of n purchases.
\[P( N(t) = n | \text{model} )\]
where N(t) is the number of repeat purchases a customer makes in t
units of time.
:param t: number units of time
:type t: float
:param n: number of purchases
:type n: int
:param X_tr: n * d1 matrix containing covariates representing
time-invariant user characteristics affecting the Transaction Rate.
d1 as number of covariates and n as number of users.
Parameters:
X_do (array_like) – n * d2 matrix containing covariates representing
time-invariant user characteristics affecting the Drop Out.
d2 as number of covariates and n as number of users.
Returns:
Probability to have n purchases up to t units of time
Conditional expected number of purchases up to time.
Calculate the expected number of repeat purchases up to time t for a
randomly chosen individual from the population, given they have
purchase history (frequency, recency, T).
frequency (array_like) – the frequency vector of customers’ purchases
(denoted x in literature).
recency (array_like) – the recency vector of customers’ purchases
(denoted t_x in literature).
T (array_like) – customers’ age (time units since first purchase)
weights (None or array_like) – Number of customers with given frequency/recency/T,
defaults to 1 if not specified. Fader and
Hardie condense the individual RFM matrix into all
observed combinations of frequency/recency/T. This
parameter represents the count of customers with a given
purchase pattern. Instead of calculating individual
loglikelihood, the loglikelihood is calculated for each
pattern and multiplied by the number of customers with
that pattern.
initial_params (array_like, optional) – set the initial parameters for the fitter.
verbose (bool, optional) – set to true to print out convergence diagnostics.
tol (float, optional) – tolerance for termination of the function minimization process.
index (array_like, optional) – index for resulted DataFrame which is accessible via self.data
kwargs – key word arguments to pass to the scipy.optimize.minimize
function as options dict
Returns:
with additional properties like params_ and methods like predict
Also known as the BG/NBD model.
Based on [1]_, this model has the following assumptions:
1) Each individual, i, has a hidden lambda_i and p_i parameter
2) These come from a population wide Gamma and a Beta distribution respectively.
3) Individuals purchases follow a Poisson process with rate \(\lambda_i*t\) .
4) After each purchase, an individual has a p_i probability of dieing (never buying again).
Parameters:
hyperparams (dict) – Dictionary containing hyperparameters for model prior parameter distributions.
frequency (array_like, optional) – a vector containing the customers’ frequencies.
Defaults to the whole set of frequencies used for fitting the model.
monetary_value (array_like, optional) – a vector containing the customers’ monetary values.
Defaults to the whole set of monetary values used for
fitting the model.
Returns:
The conditional expectation of the average profit per transaction
This method computes the average lifetime value for a group of one
or more customers.
Parameters:
transaction_prediction_model (model) – the model to predict future transactions, literature uses
pareto/ndb models but we can also use a different model like beta-geo models
frequency (array_like) – the frequency vector of customers’ purchases
(denoted x in literature).
recency (the recency vector of customers' purchases) – (denoted t_x in literature).
T (array_like) – customers’ age (time units since first purchase)
monetary_value (array_like) – the monetary value vector of customer’s purchases
(denoted m in literature).
time (float, optional) – the lifetime expected for the user in months. Default: 12
frequency (array_like) – the frequency vector of customers’ purchases
(denoted x in literature).
monetary_value (array_like) – the monetary value vector of customer’s purchases
(denoted m in literature).
weights (None or array_like) – Number of customers with given frequency/monetary_value,
defaults to 1 if not specified. Fader and
Hardie condense the individual RFM matrix into all
observed combinations of frequency/monetary_value. This
parameter represents the count of customers with a given
purchase pattern. Instead of calculating individual
loglikelihood, the loglikelihood is calculated for each
pattern and multiplied by the number of customers with
that pattern.
initial_params (array_like, optional) – set the initial parameters for the fitter.
verbose (bool, optional) – set to true to print out convergence diagnostics.
tol (float, optional) – tolerance for termination of the function minimization process.
index (array_like, optional) – index for resulted DataFrame which is accessible via self.data
q_constraint (bool, optional) – when q < 1, population mean will result in a negative value
leading to negative CLV outputs. If True, we penalize negative values of q to avoid this issue.
kwargs – key word arguments to pass to the scipy.optimize.minimize
function as options dict
Base method for running Gamma-Gamma model predictions.
Parameters:
method (str) – Predictive quantity of interest; accepts ‘avg_value’ or ‘clv’.
rfm_df (pandas.DataFrame) – Dataframe containing recency, frequency, monetary value, and time period columns.
sample_posterior (bool) – Flag for sampling from parameter posteriors. Set to ‘True’ to return predictive probability distributions instead of point estimates.
posterior_draws (int) – Number of draws from parameter posteriors.
join_df (bool) – NOT SUPPORTED IN 0.1beta2. Flag to add columns to rfm_df containing predictive outputs.
transaction_prediction_model (btyd.models) – the model to predict future transactions, literature uses pareto/ndb models but we can also use a different model like beta-geo models
time (float, optional) – the lifetime expected for the user in months. Default: 12
Based on [5]_, [6]_, this model has the following assumptions:
1) Each individual, i, has a hidden lambda_i and p_i parameter
2) These come from a population wide Gamma and a Beta distribution
respectively.
Individuals purchases follow a Poisson process with rate \(\lambda_i*t\) .
At the beginning of their lifetime and after each purchase, an
individual has a p_i probability of dieing (never buying again).
Parameters:
hyperparams (dict) – Dictionary containing hyperparameters for model prior parameter distributions.
Based on [5]_, [6]_, this model has the following assumptions:
1) Each individual, i, has a hidden lambda_i and p_i parameter
2) These come from a population wide Gamma and a Beta distribution
respectively.
Individuals purchases follow a Poisson process with rate \(\lambda_i*t\) .
At the beginning of their lifetime and after each purchase, an
individual has a p_i probability of dieing (never buying again).
Conditional expected number of repeat purchases up to time t.
Calculate the expected number of repeat purchases up to time t for a
randomly choose individual from the population, given they have
purchase history (frequency, recency, T)
See Wagner, U. and Hoppe D. (2008).
Parameters:
t (array_like) – times to calculate the expectation for.
frequency (array_like) – historical frequency of customer.
recency (array_like) – historical recency of customer.
frequency (array_like) – the frequency vector of customers’ purchases
(denoted x in literature).
recency (array_like) – the recency vector of customers’ purchases
(denoted t_x in literature).
T (array_like) – customers’ age (time units since first purchase)
weights (None or array_like) – Number of customers with given frequency/recency/T,
defaults to 1 if not specified. Fader and
Hardie condense the individual RFM matrix into all
observed combinations of frequency/recency/T. This
parameter represents the count of customers with a given
purchase pattern. Instead of calculating individual
log-likelihood, the log-likelihood is calculated for each
pattern and multiplied by the number of customers with
that pattern.
verbose (bool, optional) – set to true to print out convergence diagnostics.
tol (float, optional) – tolerance for termination of the function minimization process.
index (array_like, optional) – index for resulted DataFrame which is accessible via self.data
kwargs – key word arguments to pass to the scipy.optimize.minimize
function as options dict
Returns:
With additional properties and methods like params_ and predict
Conditional expected number of purchases up to time.
Calculate the expected number of repeat purchases up to time t for a
randomly choose individual from the population, given they have
purchase history (frequency, recency, T).
frequency (array_like) – the frequency vector of customers’ purchases
(denoted x in literature).
recency (array_like) – the recency vector of customers’ purchases
(denoted t_x in literature).
T (array_like) – customers’ age (time units since first purchase)
weights (None or array_like) – Number of customers with given frequency/recency/T,
defaults to 1 if not specified. Fader and
Hardie condense the individual RFM matrix into all
observed combinations of frequency/recency/T. This
parameter represents the count of customers with a given
purchase pattern. Instead of calculating individual
log-likelihood, the log-likelihood is calculated for each
pattern and multiplied by the number of customers with
that pattern.
This currently relies too much on the BTYD.util calibration_and_holdout_data function.
Parameters:
model (BTYD model) – A fitted BTYD model.
calibration_holdout_matrix (pandas DataFrame) – DataFrame from calibration_and_holdout_data function.
kind (str, optional) –
x-axis :”frequency_cal”. Purchases in calibration period,
”recency_cal”. Age of customer at last purchase,
“T_cal”. Age of customer at the end of calibration period,
“time_since_last_purchase”. Time since user made last purchase
Generate artificial data according to the Beta-Geometric/Beta-Binomial
Model.
You may wonder why we can have frequency = n_periods, when frequency excludes their
first order. When a customer purchases something, they are born, _and in the next
period_ we start asking questions about their alive-ness. So really they customer has
bought frequency + 1, and been observed for n_periods + 1
Parameters:
N (array_like) – Number of transaction opportunities for new customers.