Create a summary of each customer over a calibration and holdout period.
This function creates a summary of each customer over a calibration and
holdout period (training and testing, respectively).
It accepts transaction data, and returns a DataFrame of sufficient statistics.
transactions – a Pandas DataFrame that contains the customer_id col and the datetime col.
customer_id_col (string) – the column in transactions DataFrame that denotes the customer_id
datetime_col (string) – the column in transactions that denotes the datetime the purchase was made.
calibration_period_end – a period to limit the calibration to, inclusive.
observation_period_end – a string or datetime to denote the final date of the study.
Events after this date are truncated. If not given, defaults to the max ‘datetime_col’.
freq_multiplier (int, optional) – Default: 1. Useful for getting exact recency & T. Example:
With freq=’D’ and freq_multiplier=1, we get recency=591 and T=632
With freq=’h’ and freq_multiplier=24, we get recency=590.125 and T=631.375
datetime_format (string, optional) – a string that represents the timestamp format. Useful if Pandas can’t understand
the provided format.
monetary_value_col (string, optional) – the column in transactions that denotes the monetary value of the transaction.
Optional, only needed for customer lifetime value estimation models.
include_first_transaction (bool, optional) – Default: False
By default the first transaction is not included while calculating frequency and
monetary_value. Can be set to True to include it.
Should be False if you are going to use this data with any fitters in BTYD package
A dataframe with columns frequency_cal, recency_cal, T_cal, frequency_holdout, duration_holdout
If monetary_value_col isn’t None, the dataframe will also have the columns monetary_value_cal and
Get expected and actual repeated cumulative transactions.
Uses the expected_number_of_purchases_up_to_time() method from the fitted model
to predict the cumulative number of purchases.
This function follows the formulation on page 8 of [1]_.
In more detail, we take only the customers who have made their first
transaction before the specific date and then multiply them by the distribution of the
expected_number_of_purchases_up_to_time() for their whole future. Doing that for
all dates and then summing the distributions will give us the complete cumulative
model – A fitted BTYD model
transactions – a Pandas DataFrame containing the transactions history of the customer_id
datetime_col (string) – the column in transactions that denotes the datetime the purchase was made.
customer_id_col (string) – the column in transactions that denotes the customer_id
t (int) – the number of time units since the begining of
data for which we want to calculate cumulative transactions
datetime_format (string, optional) – a string that represents the timestamp format. Useful if Pandas can’t
understand the provided format.
freq_multiplier (int, optional) – Default: 1. Useful for getting exact recency & T. Example:
With freq=’D’ and freq_multiplier=1, we get recency=591 and T=632
With freq=’h’ and freq_multiplier=24, we get recency=590.125 and T=631.375
set_index_date (bool, optional) – when True set date as Pandas DataFrame index, default False - number of time units
This transforms a DataFrame of transaction data of the form:
customer_id, datetime [, monetary_value]
to a DataFrame of the form:
customer_id, frequency, recency, T [, monetary_value]
transactions – a Pandas DataFrame that contains the customer_id col and the datetime col.
customer_id_col (string) – the column in transactions DataFrame that denotes the customer_id
datetime_col (string) – the column in transactions that denotes the datetime the purchase was made.
monetary_value_col (string, optional) – the columns in the transactions that denotes the monetary value of the transaction.
Optional, only needed for customer lifetime value estimation models.
observation_period_end (datetime, optional) – a string or datetime to denote the final date of the study.
Events after this date are truncated. If not given, defaults to the max ‘datetime_col’.
datetime_format (string, optional) – a string that represents the timestamp format. Useful if Pandas can’t understand
the provided format.
freq_multiplier (int, optional) – Default: 1. Useful for getting exact recency & T. Example:
With freq=’D’ and freq_multiplier=1, we get recency=591 and T=632
With freq=’h’ and freq_multiplier=24, we get recency=590.125 and T=631.375
include_first_transaction (bool, optional) – Default: False
By default the first transaction is not included while calculating frequency and
monetary_value. Can be set to True to include it.
Should be False if you are going to use this data with any fitters in BTYD package
customer_id, frequency, recency, T [, monetary_value]
Also known as the Beta-Geometric/Beta-Binomial Model [1]_.
Future purchases opportunities are treated as discrete points in time.
In the literature, the model provides a better fit than the Pareto/NBD
model for a nonprofit organization with regular giving patterns.
The model is estimated with a recency-frequency matrix with n transaction
penalizer_coef (float) – The coefficient applied to an l2 norm on the parameters
Conditional expected purchases in future time period.
The expected number of future transactions across the next m_periods_in_future
transaction opportunities by a customer with purchase history
(x, tx, n).
frequency (array_like) – Total periods with observed transactions
recency (array_like) – Period of most recent transaction
n_periods (array_like) – Number of transaction opportunities. Previously called n.
weights (None or array_like) – Number of customers with given frequency/recency/T,
defaults to 1 if not specified. Fader and
Hardie condense the individual RFM matrix into all
observed combinations of frequency/recency/T. This
parameter represents the count of customers with a given
purchase pattern. Instead of calculating individual
log-likelihood, the log-likelihood is calculated for each
pattern and multiplied by the number of customers with
that pattern. Previously called n_custs.
verbose (boolean, optional) – Set to true to print out convergence diagnostics.
tol (float, optional) – Tolerance for termination of the function minimization process.
index (array_like, optional) – Index for resulted DataFrame which is accessible via
kwargs – Key word arguments to pass to the scipy.optimize.minimize
function as options dict
Also known as the BG/NBD model.
Based on [2] and [3], this model has the following assumptions:
1) Each individual, i, has a hidden lambda_i and p_i parameter
2) These come from a population wide Gamma and a Beta distribution
Individuals purchases follow a Poisson process with rate lambda_i*t .
After each purchase, an individual has a p_i probability of dieing
(never buying again).
penalizer_coef (float) – The coefficient applied to an l2 norm on the parameters
Conditional expected number of purchases up to time.
Calculate the expected number of repeat purchases up to time t for a
randomly choose individual from the population, given they have
purchase history (frequency, recency, T)
:param t: times to calculate the expectation for.
:type t: array_like
:param frequency: historical frequency of customer.
:type frequency: array_like
:param recency: historical recency of customer.
:type recency: array_like
:param T: age of the customer.
:type T: array_like
:param X_tr: n * d1 matrix containing covariates representing
time-invariant user characteristics affecting the Transaction Rate.
d1 as number of covariates and n as number of users.
X_do (array_like) – n * d2 matrix containing covariates representing
time-invariant user characteristics affecting the Drop Out.
d2 as number of covariates and n as number of users.
Compute conditional probability alive.
Compute the probability that a customer with history
(frequency, recency, T) is currently alive.
:param frequency: historical frequency of customer.
:type frequency: float
:param recency: historical recency of customer.
:type recency: float
:param T: age of the customer.
:type T: float
:param X_tr: n * d1 matrix containing covariates representing
time-invariant user characteristics affecting the Transaction Rate.
d1 as number of covariates and n as number of users.
X_do (array_like) – n * d2 matrix containing covariates representing
time-invariant user characteristics affecting the Drop Out.
d2 as number of covariates and n as number of users.
ln_exp_max (int) – to what value clip log_div equation
Compute the probability alive matrix.
:param X_tr: n * d1 matrix containing covariates representing
time-invariant user characteristics affecting the Transaction Rate.
d1 as number of covariates and n as number of users.
X_do (array_like) – n * d2 matrix containing covariates representing
time-invariant user characteristics affecting the Drop Out.
d2 as number of covariates and n as number of users.
max_frequency (float, optional) – the maximum frequency to plot. Default is max observed frequency.
max_recency (float, optional) – the maximum recency to plot. This also determines the age of the
customer. Default to max observed age.
A matrix of the form [t_x: historical recency, x: historical frequency]
Calculate the expected number of repeat purchases up to time t.
Calculate repeat purchases for a randomly choose individual from the
:param t: times to calculate the expection for
:type t: array_like
:param X_tr: n * d1 matrix containing covariates representing
time-invariant user characteristics affecting the Transaction Rate.
d1 as number of covariates and n as number of users.
X_do (array_like) – n * d2 matrix containing covariates representing
time-invariant user characteristics affecting the Drop Out.
d2 as number of covariates and n as number of users.
Fit a dataset to the BG/NBD model.
:param frequency: the frequency vector of customers’ purchases
(denoted x in literature).
recency (array_like) – the recency vector of customers’ purchases
(denoted t_x in literature).
T (array_like) – customers’ age (time units since first purchase)
X_tr (array_like) – n * d1 matrix containing covariates representing
time-invariant user characteristics affecting the Transaction Rate.
d1 as number of covariates and n as number of users.
X_do (array_like) – n * d2 matrix containing covariates representing
time-invariant user characteristics affecting the Drop Out.
d2 as number of covariates and n as number of users.
weights (None or array_like) – Number of customers with given frequency/recency/T,
defaults to 1 if not specified. Fader and
Hardie condense the individual RFM matrix into all
observed combinations of frequency/recency/T. This
parameter represents the count of customers with a given
purchase pattern. Instead of calculating individual
loglikelihood, the loglikelihood is calculated for each
pattern and multiplied by the number of customers with
that pattern.
initial_params (array_like, optional) – set the initial parameters for the fitter.
verbose (bool, optional) – set to true to print out convergence diagnostics.
tol (float, optional) – tolerance for termination of the function minimization process.
index (array_like, optional) – index for resulted DataFrame which is accessible via
kwargs – key word arguments to pass to the scipy.optimize.minimize
function as options dict
with additional properties like params_ and methods like predict
probability_of_n_purchases_up_to_time(t, n, X_tr, X_do)#
Compute the probability of n purchases.
\[P( N(t) = n | \text{model} )\]
where N(t) is the number of repeat purchases a customer makes in t
units of time.
:param t: number units of time
:type t: float
:param n: number of purchases
:type n: int
:param X_tr: n * d1 matrix containing covariates representing
time-invariant user characteristics affecting the Transaction Rate.
d1 as number of covariates and n as number of users.
X_do (array_like) – n * d2 matrix containing covariates representing
time-invariant user characteristics affecting the Drop Out.
d2 as number of covariates and n as number of users.
Probability to have n purchases up to t units of time
Conditional expected number of purchases up to time.
Calculate the expected number of repeat purchases up to time t for a
randomly chosen individual from the population, given they have
purchase history (frequency, recency, T).
frequency (array_like) – the frequency vector of customers’ purchases
(denoted x in literature).
recency (array_like) – the recency vector of customers’ purchases
(denoted t_x in literature).
T (array_like) – customers’ age (time units since first purchase)
weights (None or array_like) – Number of customers with given frequency/recency/T,
defaults to 1 if not specified. Fader and
Hardie condense the individual RFM matrix into all
observed combinations of frequency/recency/T. This
parameter represents the count of customers with a given
purchase pattern. Instead of calculating individual
loglikelihood, the loglikelihood is calculated for each
pattern and multiplied by the number of customers with
that pattern.
initial_params (array_like, optional) – set the initial parameters for the fitter.
verbose (bool, optional) – set to true to print out convergence diagnostics.
tol (float, optional) – tolerance for termination of the function minimization process.
index (array_like, optional) – index for resulted DataFrame which is accessible via
kwargs – key word arguments to pass to the scipy.optimize.minimize
function as options dict
with additional properties like params_ and methods like predict
Also known as the BG/NBD model.
Based on [1]_, this model has the following assumptions:
1) Each individual, i, has a hidden lambda_i and p_i parameter
2) These come from a population wide Gamma and a Beta distribution respectively.
3) Individuals purchases follow a Poisson process with rate \(\lambda_i*t\) .
4) After each purchase, an individual has a p_i probability of dieing (never buying again).
hyperparams (dict) – Dictionary containing hyperparameters for model prior parameter distributions.
frequency (array_like, optional) – a vector containing the customers’ frequencies.
Defaults to the whole set of frequencies used for fitting the model.
monetary_value (array_like, optional) – a vector containing the customers’ monetary values.
Defaults to the whole set of monetary values used for
fitting the model.
The conditional expectation of the average profit per transaction
This method computes the average lifetime value for a group of one
or more customers.
transaction_prediction_model (model) – the model to predict future transactions, literature uses
pareto/ndb models but we can also use a different model like beta-geo models
frequency (array_like) – the frequency vector of customers’ purchases
(denoted x in literature).
recency (the recency vector of customers' purchases) – (denoted t_x in literature).
T (array_like) – customers’ age (time units since first purchase)
monetary_value (array_like) – the monetary value vector of customer’s purchases
(denoted m in literature).
time (float, optional) – the lifetime expected for the user in months. Default: 12
frequency (array_like) – the frequency vector of customers’ purchases
(denoted x in literature).
monetary_value (array_like) – the monetary value vector of customer’s purchases
(denoted m in literature).
weights (None or array_like) – Number of customers with given frequency/monetary_value,
defaults to 1 if not specified. Fader and
Hardie condense the individual RFM matrix into all
observed combinations of frequency/monetary_value. This
parameter represents the count of customers with a given
purchase pattern. Instead of calculating individual
loglikelihood, the loglikelihood is calculated for each
pattern and multiplied by the number of customers with
that pattern.
initial_params (array_like, optional) – set the initial parameters for the fitter.
verbose (bool, optional) – set to true to print out convergence diagnostics.
tol (float, optional) – tolerance for termination of the function minimization process.
index (array_like, optional) – index for resulted DataFrame which is accessible via
q_constraint (bool, optional) – when q < 1, population mean will result in a negative value
leading to negative CLV outputs. If True, we penalize negative values of q to avoid this issue.
kwargs – key word arguments to pass to the scipy.optimize.minimize
function as options dict
Base method for running Gamma-Gamma model predictions.
method (str) – Predictive quantity of interest; accepts ‘avg_value’ or ‘clv’.
rfm_df (pandas.DataFrame) – Dataframe containing recency, frequency, monetary value, and time period columns.
sample_posterior (bool) – Flag for sampling from parameter posteriors. Set to ‘True’ to return predictive probability distributions instead of point estimates.
posterior_draws (int) – Number of draws from parameter posteriors.
join_df (bool) – NOT SUPPORTED IN 0.1beta2. Flag to add columns to rfm_df containing predictive outputs.
transaction_prediction_model (btyd.models) – the model to predict future transactions, literature uses pareto/ndb models but we can also use a different model like beta-geo models
time (float, optional) – the lifetime expected for the user in months. Default: 12
Based on [5]_, [6]_, this model has the following assumptions:
1) Each individual, i, has a hidden lambda_i and p_i parameter
2) These come from a population wide Gamma and a Beta distribution
Individuals purchases follow a Poisson process with rate \(\lambda_i*t\) .
At the beginning of their lifetime and after each purchase, an
individual has a p_i probability of dieing (never buying again).
hyperparams (dict) – Dictionary containing hyperparameters for model prior parameter distributions.
Based on [5]_, [6]_, this model has the following assumptions:
1) Each individual, i, has a hidden lambda_i and p_i parameter
2) These come from a population wide Gamma and a Beta distribution
Individuals purchases follow a Poisson process with rate \(\lambda_i*t\) .
At the beginning of their lifetime and after each purchase, an
individual has a p_i probability of dieing (never buying again).
Conditional expected number of repeat purchases up to time t.
Calculate the expected number of repeat purchases up to time t for a
randomly choose individual from the population, given they have
purchase history (frequency, recency, T)
See Wagner, U. and Hoppe D. (2008).
t (array_like) – times to calculate the expectation for.
frequency (array_like) – historical frequency of customer.
recency (array_like) – historical recency of customer.
frequency (array_like) – the frequency vector of customers’ purchases
(denoted x in literature).
recency (array_like) – the recency vector of customers’ purchases
(denoted t_x in literature).
T (array_like) – customers’ age (time units since first purchase)
weights (None or array_like) – Number of customers with given frequency/recency/T,
defaults to 1 if not specified. Fader and
Hardie condense the individual RFM matrix into all
observed combinations of frequency/recency/T. This
parameter represents the count of customers with a given
purchase pattern. Instead of calculating individual
log-likelihood, the log-likelihood is calculated for each
pattern and multiplied by the number of customers with
that pattern.
verbose (bool, optional) – set to true to print out convergence diagnostics.
tol (float, optional) – tolerance for termination of the function minimization process.
index (array_like, optional) – index for resulted DataFrame which is accessible via
kwargs – key word arguments to pass to the scipy.optimize.minimize
function as options dict
With additional properties and methods like params_ and predict
Conditional expected number of purchases up to time.
Calculate the expected number of repeat purchases up to time t for a
randomly choose individual from the population, given they have
purchase history (frequency, recency, T).
frequency (array_like) – the frequency vector of customers’ purchases
(denoted x in literature).
recency (array_like) – the recency vector of customers’ purchases
(denoted t_x in literature).
T (array_like) – customers’ age (time units since first purchase)
weights (None or array_like) – Number of customers with given frequency/recency/T,
defaults to 1 if not specified. Fader and
Hardie condense the individual RFM matrix into all
observed combinations of frequency/recency/T. This
parameter represents the count of customers with a given
purchase pattern. Instead of calculating individual
log-likelihood, the log-likelihood is calculated for each
pattern and multiplied by the number of customers with
that pattern.
This currently relies too much on the BTYD.util calibration_and_holdout_data function.
model (BTYD model) – A fitted BTYD model.
calibration_holdout_matrix (pandas DataFrame) – DataFrame from calibration_and_holdout_data function.
kind (str, optional) –
x-axis :”frequency_cal”. Purchases in calibration period,
”recency_cal”. Age of customer at last purchase,
“T_cal”. Age of customer at the end of calibration period,
“time_since_last_purchase”. Time since user made last purchase
Generate artificial data according to the Beta-Geometric/Beta-Binomial
You may wonder why we can have frequency = n_periods, when frequency excludes their
first order. When a customer purchases something, they are born, _and in the next
period_ we start asking questions about their alive-ness. So really they customer has
bought frequency + 1, and been observed for n_periods + 1
N (array_like) – Number of transaction opportunities for new customers.