API Reference#

utils module#

btyd.utils.calculate_alive_path(model, transactions, datetime_col, t, freq='D')#

Calculate alive path for plotting alive history of user.

Uses the conditional_probability_alive() method of the model to achieve the path.

Parameters:
  • model – A fitted BTYD model

  • transactions (DataFrame) – a Pandas DataFrame containing the transactions history of the customer_id

  • datetime_col (string) – the column in the transactions that denotes the datetime the purchase was made

  • t (array_like) – the number of time units since the birth for which we want to draw the p_alive

  • freq (string, optional) – Default: ‘D’ for days. Possible values listed here: https://numpy.org/devdocs/reference/arrays.datetime.html#datetime-units

Returns:

A pandas Series containing the p_alive as a function of T (age of the customer)

Return type:

obj: Series

btyd.utils.calibration_and_holdout_data(transactions, customer_id_col, datetime_col, calibration_period_end, observation_period_end=None, freq='D', freq_multiplier=1, datetime_format=None, monetary_value_col=None, include_first_transaction=False)#

Create a summary of each customer over a calibration and holdout period.

This function creates a summary of each customer over a calibration and holdout period (training and testing, respectively). It accepts transaction data, and returns a DataFrame of sufficient statistics.

Parameters:
  • transactions – a Pandas DataFrame that contains the customer_id col and the datetime col.

  • customer_id_col (string) – the column in transactions DataFrame that denotes the customer_id

  • datetime_col (string) – the column in transactions that denotes the datetime the purchase was made.

  • calibration_period_end – a period to limit the calibration to, inclusive.

  • observation_period_end – a string or datetime to denote the final date of the study. Events after this date are truncated. If not given, defaults to the max ‘datetime_col’.

  • freq (string, optional) – Default: ‘D’ for days. Possible values listed here: https://numpy.org/devdocs/reference/arrays.datetime.html#datetime-units

  • freq_multiplier (int, optional) – Default: 1. Useful for getting exact recency & T. Example: With freq=’D’ and freq_multiplier=1, we get recency=591 and T=632 With freq=’h’ and freq_multiplier=24, we get recency=590.125 and T=631.375

  • datetime_format (string, optional) – a string that represents the timestamp format. Useful if Pandas can’t understand the provided format.

  • monetary_value_col (string, optional) – the column in transactions that denotes the monetary value of the transaction. Optional, only needed for customer lifetime value estimation models.

  • include_first_transaction (bool, optional) – Default: False By default the first transaction is not included while calculating frequency and monetary_value. Can be set to True to include it. Should be False if you are going to use this data with any fitters in BTYD package

Returns:

A dataframe with columns frequency_cal, recency_cal, T_cal, frequency_holdout, duration_holdout If monetary_value_col isn’t None, the dataframe will also have the columns monetary_value_cal and monetary_value_holdout.

Return type:

obj: DataFrame

btyd.utils.expected_cumulative_transactions(model, transactions, datetime_col, customer_id_col, t, datetime_format=None, freq='D', freq_multiplier=1, set_index_date=False)#

Get expected and actual repeated cumulative transactions.

Uses the expected_number_of_purchases_up_to_time() method from the fitted model to predict the cumulative number of purchases.

This function follows the formulation on page 8 of [1]_.

In more detail, we take only the customers who have made their first transaction before the specific date and then multiply them by the distribution of the expected_number_of_purchases_up_to_time() for their whole future. Doing that for all dates and then summing the distributions will give us the complete cumulative purchases.

Parameters:
  • model – A fitted BTYD model

  • transactions – a Pandas DataFrame containing the transactions history of the customer_id

  • datetime_col (string) – the column in transactions that denotes the datetime the purchase was made.

  • customer_id_col (string) – the column in transactions that denotes the customer_id

  • t (int) – the number of time units since the begining of data for which we want to calculate cumulative transactions

  • datetime_format (string, optional) – a string that represents the timestamp format. Useful if Pandas can’t understand the provided format.

  • freq (string, optional) – Default: ‘D’ for days. Possible values listed here: https://numpy.org/devdocs/reference/arrays.datetime.html#datetime-units

  • freq_multiplier (int, optional) – Default: 1. Useful for getting exact recency & T. Example: With freq=’D’ and freq_multiplier=1, we get recency=591 and T=632 With freq=’h’ and freq_multiplier=24, we get recency=590.125 and T=631.375

  • set_index_date (bool, optional) – when True set date as Pandas DataFrame index, default False - number of time units

Returns:

A dataframe with columns actual, predicted

Return type:

obj: DataFrame

References

A Note on Implementing the Pareto/NBD Model in MATLAB. http://brucehardie.com/notes/008/

btyd.utils.summary_data_from_transaction_data(transactions, customer_id_col, datetime_col, monetary_value_col=None, datetime_format=None, observation_period_end=None, freq='D', freq_multiplier=1, include_first_transaction=False)#

Return summary data from transactions.

This transforms a DataFrame of transaction data of the form:

customer_id, datetime [, monetary_value]

to a DataFrame of the form:

customer_id, frequency, recency, T [, monetary_value]

Parameters:
  • transactions – a Pandas DataFrame that contains the customer_id col and the datetime col.

  • customer_id_col (string) – the column in transactions DataFrame that denotes the customer_id

  • datetime_col (string) – the column in transactions that denotes the datetime the purchase was made.

  • monetary_value_col (string, optional) – the columns in the transactions that denotes the monetary value of the transaction. Optional, only needed for customer lifetime value estimation models.

  • observation_period_end (datetime, optional) – a string or datetime to denote the final date of the study. Events after this date are truncated. If not given, defaults to the max ‘datetime_col’.

  • datetime_format (string, optional) – a string that represents the timestamp format. Useful if Pandas can’t understand the provided format.

  • freq (string, optional) – Default: ‘D’ for days. Possible values listed here: https://numpy.org/devdocs/reference/arrays.datetime.html#datetime-units

  • freq_multiplier (int, optional) – Default: 1. Useful for getting exact recency & T. Example: With freq=’D’ and freq_multiplier=1, we get recency=591 and T=632 With freq=’h’ and freq_multiplier=24, we get recency=590.125 and T=631.375

  • include_first_transaction (bool, optional) – Default: False By default the first transaction is not included while calculating frequency and monetary_value. Can be set to True to include it. Should be False if you are going to use this data with any fitters in BTYD package

Returns:

customer_id, frequency, recency, T [, monetary_value]

Return type:

obj: DataFrame:

class btyd.BetaGeoBetaBinomFitter(penalizer_coef=0.0)#

Also known as the Beta-Geometric/Beta-Binomial Model [1]_.

Future purchases opportunities are treated as discrete points in time. In the literature, the model provides a better fit than the Pareto/NBD model for a nonprofit organization with regular giving patterns.

The model is estimated with a recency-frequency matrix with n transaction opportunities.

Parameters:

penalizer_coef (float) – The coefficient applied to an l2 norm on the parameters

penalizer_coef#

The coefficient applied to an l2 norm on the parameters

Type:

float

params_#

The fitted parameters of the model

Type:
obj:

Series

data#

A DataFrame with the values given in the call to fit

Type:
obj:

DataFrame

variance_matrix_#

A DataFrame with the variance matrix of the parameters.

Type:
obj:

DataFrame

confidence_intervals_#

A DataFrame 95% confidence intervals of the parameters

Type:
obj:

DataFrame

standard_errors_#

A Series with the standard errors of the parameters

Type:
obj:

Series

summary#

A DataFrame containing information about the fitted parameters

Type:
obj:

DataFrame

References

conditional_expected_number_of_purchases_up_to_time(m_periods_in_future, frequency, recency, n_periods)#

Conditional expected purchases in future time period.

The expected number of future transactions across the next m_periods_in_future transaction opportunities by a customer with purchase history (x, tx, n).

\[E(X(n_{periods}, n_{periods}+m_{periods_in_future})| \alpha, \beta, \gamma, \delta, frequency, recency, n_{periods})\]

See (13) in Fader & Hardie 2010.

Parameters:

t (array_like) – time n_periods (n+t)

Returns:

predicted transactions

Return type:

array_like

conditional_probability_alive(m_periods_in_future, frequency, recency, n_periods)#

Conditional probability alive.

Conditional probability customer is alive at transaction opportunity n_periods + m_periods_in_future.

\[P(alive at n_periods + m_periods_in_future|alpha, beta, gamma, delta, frequency, recency, n_periods)\]

See (A10) in Fader and Hardie 2010.

Parameters:

m (array_like) – transaction opportunities

Returns:

alive probabilities

Return type:

array_like

expected_number_of_transactions_in_first_n_periods(n)#

Return expected number of transactions in first n n_periods.

Expected number of transactions occurring across first n transaction opportunities. Used by Fader and Hardie to assess in-sample fit.

\[Pr(X(n) = x| \alpha, \beta, \gamma, \delta)\]

See (7) in Fader & Hardie 2010.

Parameters:

n (float) – number of transaction opportunities

Returns:

Predicted values, indexed by x

Return type:

DataFrame

fit(frequency, recency, n_periods, weights=None, initial_params=None, verbose=False, tol=1e-07, index=None, **kwargs)#

Fit the BG/BB model.

Parameters:
  • frequency (array_like) – Total periods with observed transactions

  • recency (array_like) – Period of most recent transaction

  • n_periods (array_like) – Number of transaction opportunities. Previously called n.

  • weights (None or array_like) – Number of customers with given frequency/recency/T, defaults to 1 if not specified. Fader and Hardie condense the individual RFM matrix into all observed combinations of frequency/recency/T. This parameter represents the count of customers with a given purchase pattern. Instead of calculating individual log-likelihood, the log-likelihood is calculated for each pattern and multiplied by the number of customers with that pattern. Previously called n_custs.

  • verbose (boolean, optional) – Set to true to print out convergence diagnostics.

  • tol (float, optional) – Tolerance for termination of the function minimization process.

  • index (array_like, optional) – Index for resulted DataFrame which is accessible via self.data

  • kwargs – Key word arguments to pass to the scipy.optimize.minimize function as options dict

Returns:

fitted and with parameters estimated

Return type:

BetaGeoBetaBinomFitter

class btyd.BetaGeoCovarsFitter(penalizer_coef=0.0)#

Also known as the BG/NBD model. Based on [2] and [3], this model has the following assumptions: 1) Each individual, i, has a hidden lambda_i and p_i parameter 2) These come from a population wide Gamma and a Beta distribution

respectively.

  1. Individuals purchases follow a Poisson process with rate lambda_i*t .

  2. After each purchase, an individual has a p_i probability of dieing (never buying again).

Parameters:

penalizer_coef (float) – The coefficient applied to an l2 norm on the parameters

penalizer_coef#

The coefficient applied to an l2 norm on the parameters

Type:

float

params_#

The fitted parameters of the model

Type:
obj:

OrderedDict

data#

A DataFrame with the columns given in the call to fit

Type:
obj:

DataFrame

References

conditional_expected_number_of_purchases_up_to_time(t, frequency, recency, T, X_tr, X_do)#

Conditional expected number of purchases up to time. Calculate the expected number of repeat purchases up to time t for a randomly choose individual from the population, given they have purchase history (frequency, recency, T) :param t: times to calculate the expectation for. :type t: array_like :param frequency: historical frequency of customer. :type frequency: array_like :param recency: historical recency of customer. :type recency: array_like :param T: age of the customer. :type T: array_like :param X_tr: n * d1 matrix containing covariates representing

time-invariant user characteristics affecting the Transaction Rate. d1 as number of covariates and n as number of users.

Parameters:

X_do (array_like) – n * d2 matrix containing covariates representing time-invariant user characteristics affecting the Drop Out. d2 as number of covariates and n as number of users.

Return type:

array_like

conditional_probability_alive(frequency, recency, T, X_tr, X_do, ln_exp_max=300)#

Compute conditional probability alive. Compute the probability that a customer with history (frequency, recency, T) is currently alive. From http://www.brucehardie.com/notes/021/palive_for_BGNBD.pdf :param frequency: historical frequency of customer. :type frequency: float :param recency: historical recency of customer. :type recency: float :param T: age of the customer. :type T: float :param X_tr: n * d1 matrix containing covariates representing

time-invariant user characteristics affecting the Transaction Rate. d1 as number of covariates and n as number of users.

Parameters:
  • X_do (array_like) – n * d2 matrix containing covariates representing time-invariant user characteristics affecting the Drop Out. d2 as number of covariates and n as number of users.

  • ln_exp_max (int) – to what value clip log_div equation

Returns:

value representing a probability

Return type:

float

conditional_probability_alive_matrix(X_tr, X_do, max_frequency=None, max_recency=None)#

Compute the probability alive matrix. :param X_tr: n * d1 matrix containing covariates representing

time-invariant user characteristics affecting the Transaction Rate. d1 as number of covariates and n as number of users.

Parameters:
  • X_do (array_like) – n * d2 matrix containing covariates representing time-invariant user characteristics affecting the Drop Out. d2 as number of covariates and n as number of users.

  • max_frequency (float, optional) – the maximum frequency to plot. Default is max observed frequency.

  • max_recency (float, optional) – the maximum recency to plot. This also determines the age of the customer. Default to max observed age.

Returns:

A matrix of the form [t_x: historical recency, x: historical frequency]

Return type:

matrix

expected_number_of_purchases_up_to_time(t, X_tr, X_do)#

Calculate the expected number of repeat purchases up to time t. Calculate repeat purchases for a randomly choose individual from the population. :param t: times to calculate the expection for :type t: array_like :param X_tr: n * d1 matrix containing covariates representing

time-invariant user characteristics affecting the Transaction Rate. d1 as number of covariates and n as number of users.

Parameters:

X_do (array_like) – n * d2 matrix containing covariates representing time-invariant user characteristics affecting the Drop Out. d2 as number of covariates and n as number of users.

Return type:

array_like

fit(frequency, recency, T, X_tr, X_do, weights=None, iterative_fitting=1, initial_params=None, verbose=False, tol=0.0001, index=None, **kwargs)#

Fit a dataset to the BG/NBD model. :param frequency: the frequency vector of customers’ purchases

(denoted x in literature).

Parameters:
  • recency (array_like) – the recency vector of customers’ purchases (denoted t_x in literature).

  • T (array_like) – customers’ age (time units since first purchase)

  • X_tr (array_like) – n * d1 matrix containing covariates representing time-invariant user characteristics affecting the Transaction Rate. d1 as number of covariates and n as number of users.

  • X_do (array_like) – n * d2 matrix containing covariates representing time-invariant user characteristics affecting the Drop Out. d2 as number of covariates and n as number of users.

  • weights (None or array_like) – Number of customers with given frequency/recency/T, defaults to 1 if not specified. Fader and Hardie condense the individual RFM matrix into all observed combinations of frequency/recency/T. This parameter represents the count of customers with a given purchase pattern. Instead of calculating individual loglikelihood, the loglikelihood is calculated for each pattern and multiplied by the number of customers with that pattern.

  • initial_params (array_like, optional) – set the initial parameters for the fitter.

  • verbose (bool, optional) – set to true to print out convergence diagnostics.

  • tol (float, optional) – tolerance for termination of the function minimization process.

  • index (array_like, optional) – index for resulted DataFrame which is accessible via self.data

  • kwargs – key word arguments to pass to the scipy.optimize.minimize function as options dict

Returns:

with additional properties like params_ and methods like predict

Return type:

BetaGeoCovarsFitter

probability_of_n_purchases_up_to_time(t, n, X_tr, X_do)#
Compute the probability of n purchases.
\[P( N(t) = n | \text{model} )\]

where N(t) is the number of repeat purchases a customer makes in t units of time. :param t: number units of time :type t: float :param n: number of purchases :type n: int :param X_tr: n * d1 matrix containing covariates representing

time-invariant user characteristics affecting the Transaction Rate. d1 as number of covariates and n as number of users.

Parameters:

X_do (array_like) – n * d2 matrix containing covariates representing time-invariant user characteristics affecting the Drop Out. d2 as number of covariates and n as number of users.

Returns:

Probability to have n purchases up to t units of time

Return type:

float

class btyd.BetaGeoFitter(penalizer_coef=0.0)#

Also known as the BG/NBD model.

Based on [2]_, this model has the following assumptions:

  1. Each individual, i, has a hidden lambda_i and p_i parameter

  2. These come from a population wide Gamma and a Beta distribution respectively.

  3. Individuals purchases follow a Poisson process with rate lambda_i*t .

  4. After each purchase, an individual has a p_i probability of dieing (never buying again).

Parameters:

penalizer_coef (float) – The coefficient applied to an l2 norm on the parameters

penalizer_coef#

The coefficient applied to an l2 norm on the parameters

Type:

float

params_#

The fitted parameters of the model

Type:
obj:

Series

data#

A DataFrame with the values given in the call to fit

Type:
obj:

DataFrame

variance_matrix_#

A DataFrame with the variance matrix of the parameters.

Type:
obj:

DataFrame

confidence_intervals_#

A DataFrame 95% confidence intervals of the parameters

Type:
obj:

DataFrame

standard_errors_#

A Series with the standard errors of the parameters

Type:
obj:

Series

summary#

A DataFrame containing information about the fitted parameters

Type:
obj:

DataFrame

References

conditional_expected_number_of_purchases_up_to_time(t, frequency, recency, T)#

Conditional expected number of purchases up to time.

Calculate the expected number of repeat purchases up to time t for a randomly chosen individual from the population, given they have purchase history (frequency, recency, T).

This function uses equation (10) from [2]_.

Parameters:
  • t (array_like) – times to calculate the expectation for.

  • frequency (array_like) – historical frequency of customer.

  • recency (array_like) – historical recency of customer.

  • T (array_like) – age of the customer.

Return type:

array_like

References

“Counting Your Customers the Easy Way: An Alternative to the Pareto/NBD Model,” Marketing Science, 24 (2), 275-84.

conditional_probability_alive(frequency, recency, T)#

Compute conditional probability alive.

Compute the probability that a customer with history (frequency, recency, T) is currently alive.

From http://www.brucehardie.com/notes/021/palive_for_BGNBD.pdf

Parameters:
  • frequency (array or scalar) – historical frequency of customer.

  • recency (array or scalar) – historical recency of customer.

  • T (array or scalar) – age of the customer.

Returns:

value representing a probability

Return type:

array

conditional_probability_alive_matrix(max_frequency=None, max_recency=None)#

Compute the probability alive matrix.

Uses the conditional_probability_alive() method to get calculate the matrix.

Parameters:
  • max_frequency (float, optional) – the maximum frequency to plot. Default is max observed frequency.

  • max_recency (float, optional) – the maximum recency to plot. This also determines the age of the customer. Default to max observed age.

Returns:

A matrix of the form [t_x: historical recency, x: historical frequency]

Return type:

matrix

expected_number_of_purchases_up_to_time(t)#

Calculate the expected number of repeat purchases up to time t.

Calculate repeat purchases for a randomly chosen individual from the population.

Equivalent to equation (9) of [2]_.

Parameters:

t (array_like) – times to calculate the expection for

Return type:

array_like

References

“Counting Your Customers the Easy Way: An Alternative to the Pareto/NBD Model,” Marketing Science, 24 (2), 275-84.

fit(frequency, recency, T, weights=None, initial_params=None, verbose=False, tol=1e-07, index=None, **kwargs)#

Fit a dataset to the BG/NBD model.

Parameters:
  • frequency (array_like) – the frequency vector of customers’ purchases (denoted x in literature).

  • recency (array_like) – the recency vector of customers’ purchases (denoted t_x in literature).

  • T (array_like) – customers’ age (time units since first purchase)

  • weights (None or array_like) – Number of customers with given frequency/recency/T, defaults to 1 if not specified. Fader and Hardie condense the individual RFM matrix into all observed combinations of frequency/recency/T. This parameter represents the count of customers with a given purchase pattern. Instead of calculating individual loglikelihood, the loglikelihood is calculated for each pattern and multiplied by the number of customers with that pattern.

  • initial_params (array_like, optional) – set the initial parameters for the fitter.

  • verbose (bool, optional) – set to true to print out convergence diagnostics.

  • tol (float, optional) – tolerance for termination of the function minimization process.

  • index (array_like, optional) – index for resulted DataFrame which is accessible via self.data

  • kwargs – key word arguments to pass to the scipy.optimize.minimize function as options dict

Returns:

with additional properties like params_ and methods like predict

Return type:

BetaGeoFitter

probability_of_n_purchases_up_to_time(t, n)#

Compute the probability of n purchases.

\[P( N(t) = n | \text{model} )\]

where N(t) is the number of repeat purchases a customer makes in t units of time.

Comes from equation (8) of [2]_.

Parameters:
  • t (float) – number units of time

  • n (int) – number of purchases

Returns:

Probability to have n purchases up to t units of time

Return type:

float

References

“Counting Your Customers the Easy Way: An Alternative to the Pareto/NBD Model,” Marketing Science, 24 (2), 275-84.

class btyd.BetaGeoModel(hyperparams: Dict[float] = None)#

Also known as the BG/NBD model. Based on [1]_, this model has the following assumptions: 1) Each individual, i, has a hidden lambda_i and p_i parameter 2) These come from a population wide Gamma and a Beta distribution respectively. 3) Individuals purchases follow a Poisson process with rate \(\lambda_i*t\) . 4) After each purchase, an individual has a p_i probability of dieing (never buying again).

Parameters:

hyperparams (dict) – Dictionary containing hyperparameters for model prior parameter distributions.

_hyperparams#

Hyperparameters of prior parameter distributions for model fitting.

Type:

dict

_param_list#

List of estimated model parameters.

Type:

list

_model#

Hierarchical Bayesian model to estimate model parameters.

Type:

pymc.Model

_idata#

InferenceData object of fitted or loaded model. Used for predictions as well as evaluation plots, and model metrics via the ArViZ library.

Type:

ArViZ.InferenceData

References

generate_rfm_data(size: int = 1000) DataFrame#

Generate synthetic RFM data from fitted model parameters. Useful for posterior predictive checks of model performance.

Parameters:

size (int) – Rows of synthetic RFM data to generate. Default is 1000.

Returns:

self.synthetic_df – dataframe containing [“frequency”, “recency”, “T”, “lambda”, “p”, “alive”, “customer_id”] columns.

Return type:

pd.DataFrame

class btyd.GammaGammaFitter(penalizer_coef=0.0)#

Fitter for the gamma-gamma model.

It is used to estimate the average monetary value of customer transactions.

This implementation is based on the Excel spreadsheet found in [3]_. More details on the derivation and evaluation can be found in [4]_.

Parameters:

penalizer_coef (float) – The coefficient applied to an l2 norm on the parameters

penalizer_coef#

The coefficient applied to an l2 norm on the parameters

Type:

float

params_#

The fitted parameters of the model

Type:
obj:

OrderedDict

data#

A DataFrame with the columns given in the call to fit

Type:
obj:

DataFrame

References

penalizer_coef#

The coefficient applied to an l2 norm on the parameters

Type:

float

params_#

The fitted parameters of the model

Type:
obj:

Series

data#

A DataFrame with the values given in the call to fit

Type:
obj:

DataFrame

variance_matrix_#

A DataFrame with the variance matrix of the parameters.

Type:
obj:

DataFrame

confidence_intervals_#

A DataFrame 95% confidence intervals of the parameters

Type:
obj:

DataFrame

standard_errors_#

A Series with the standard errors of the parameters

Type:
obj:

Series

summary#

A DataFrame containing information about the fitted parameters

Type:
obj:

DataFrame

conditional_expected_average_profit(frequency=None, monetary_value=None)#

Conditional expectation of the average profit.

This method computes the conditional expectation of the average profit per transaction for a group of one or more customers.

Equation (5) from: http://www.brucehardie.com/notes/025/

Parameters:
  • frequency (array_like, optional) – a vector containing the customers’ frequencies. Defaults to the whole set of frequencies used for fitting the model.

  • monetary_value (array_like, optional) – a vector containing the customers’ monetary values. Defaults to the whole set of monetary values used for fitting the model.

Returns:

The conditional expectation of the average profit per transaction

Return type:

array_like

customer_lifetime_value(transaction_prediction_model, frequency, recency, T, monetary_value, time=12, discount_rate=0.01, freq='D')#

Return customer lifetime value.

This method computes the average lifetime value for a group of one or more customers.

Parameters:
  • transaction_prediction_model (model) – the model to predict future transactions, literature uses pareto/ndb models but we can also use a different model like beta-geo models

  • frequency (array_like) – the frequency vector of customers’ purchases (denoted x in literature).

  • recency (the recency vector of customers' purchases) – (denoted t_x in literature).

  • T (array_like) – customers’ age (time units since first purchase)

  • monetary_value (array_like) – the monetary value vector of customer’s purchases (denoted m in literature).

  • time (float, optional) – the lifetime expected for the user in months. Default: 12

  • discount_rate (float, optional) – the monthly adjusted discount rate. Default: 0.01

  • freq (string, optional) – {“D”, “H”, “M”, “W”} for day, hour, month, week. This represents what unit of time your T is measure in.

Returns:

Series object with customer ids as index and the estimated customer lifetime values as values

Return type:

Series

fit(frequency, monetary_value, weights=None, initial_params=None, verbose=False, tol=1e-07, index=None, q_constraint=False, **kwargs)#

Fit the data to the Gamma/Gamma model.

Parameters:
  • frequency (array_like) – the frequency vector of customers’ purchases (denoted x in literature).

  • monetary_value (array_like) – the monetary value vector of customer’s purchases (denoted m in literature).

  • weights (None or array_like) – Number of customers with given frequency/monetary_value, defaults to 1 if not specified. Fader and Hardie condense the individual RFM matrix into all observed combinations of frequency/monetary_value. This parameter represents the count of customers with a given purchase pattern. Instead of calculating individual loglikelihood, the loglikelihood is calculated for each pattern and multiplied by the number of customers with that pattern.

  • initial_params (array_like, optional) – set the initial parameters for the fitter.

  • verbose (bool, optional) – set to true to print out convergence diagnostics.

  • tol (float, optional) – tolerance for termination of the function minimization process.

  • index (array_like, optional) – index for resulted DataFrame which is accessible via self.data

  • q_constraint (bool, optional) – when q < 1, population mean will result in a negative value leading to negative CLV outputs. If True, we penalize negative values of q to avoid this issue.

  • kwargs – key word arguments to pass to the scipy.optimize.minimize function as options dict

Returns:

fitted and with parameters estimated

Return type:

GammaGammaFitter

class btyd.GammaGammaModel(hyperparams: Dict[float] = None)#

The Gamma-Gamma model is used to estimate the average monetary value of customer transactions.

This implementation is based on the Excel spreadsheet found in [3]_. More details on the derivation and evaluation can be found in [4]_.

Parameters:

hyperparams (dict) – Dictionary containing hyperparameters for model prior parameter distributions.

_hyperparams#

Hyperparameters of prior parameter distributions for model fitting.

Type:

dict

_param_list#

List of estimated model parameters.

Type:

list

_model#

Hierarchical Bayesian model to estimate model parameters.

Type:

pymc.Model

_idata#

InferenceData object of fitted or loaded model. Used for predictions as well as evaluation plots, and model metrics via the ArViZ library.

Type:

ArViZ.InferenceData

References

penalizer_coef#

The coefficient applied to an l2 norm on the parameters

Type:

float

params_#

The fitted parameters of the model

Type:
obj:

Series

data#

A DataFrame with the values given in the call to fit

Type:
obj:

DataFrame

variance_matrix_#

A DataFrame with the variance matrix of the parameters.

Type:
obj:

DataFrame

confidence_intervals_#

A DataFrame 95% confidence intervals of the parameters

Type:
obj:

DataFrame

standard_errors_#

A Series with the standard errors of the parameters

Type:
obj:

Series

summary#

A DataFrame containing information about the fitted parameters

Type:
obj:

DataFrame

generate_rfm_data()#

Not currently supported for GammaGammaModel.

predict(method: str, rfm_df: pd.DataFrame = None, sample_posterior: bool = False, posterior_draws: int = 100, join_df=False, transaction_prediction_model: btyd.Model = None, time: int = 12, discount_rate: float = 0.01, freq: str = 'D') np.ndarray#

Base method for running Gamma-Gamma model predictions.

Parameters:
  • method (str) – Predictive quantity of interest; accepts ‘avg_value’ or ‘clv’.

  • rfm_df (pandas.DataFrame) – Dataframe containing recency, frequency, monetary value, and time period columns.

  • sample_posterior (bool) – Flag for sampling from parameter posteriors. Set to ‘True’ to return predictive probability distributions instead of point estimates.

  • posterior_draws (int) – Number of draws from parameter posteriors.

  • join_df (bool) – NOT SUPPORTED IN 0.1beta2. Flag to add columns to rfm_df containing predictive outputs.

  • transaction_prediction_model (btyd.models) – the model to predict future transactions, literature uses pareto/ndb models but we can also use a different model like beta-geo models

  • time (float, optional) – the lifetime expected for the user in months. Default: 12

  • discount_rate (float, optional) – the monthly adjusted discount rate. Default: 0.01

  • freq (string, optional) – {“D”, “H”, “M”, “W”} for day, hour, month, week. This represents what unit of time your T is measured in.

Returns:

predictions – Numpy arrays containing predictive quantities of interest.

Return type:

np.ndarray

class btyd.ModBetaGeoModel(hyperparams: Dict[float] = None)#

Also known as the MBG/NBD model.

Based on [5]_, [6]_, this model has the following assumptions: 1) Each individual, i, has a hidden lambda_i and p_i parameter 2) These come from a population wide Gamma and a Beta distribution

respectively.

  1. Individuals purchases follow a Poisson process with rate \(\lambda_i*t\) .

  2. At the beginning of their lifetime and after each purchase, an individual has a p_i probability of dieing (never buying again).

Parameters:

hyperparams (dict) – Dictionary containing hyperparameters for model prior parameter distributions.

_hyperparams#

Hyperparameters of prior parameter distributions for model fitting.

Type:

dict

_param_list#

List of estimated model parameters.

Type:

list

_model#

Hierarchical Bayesian model to estimate model parameters.

Type:

pymc.Model

_idata#

InferenceData object of fitted or loaded model. Used for predictions as well as evaluation plots, and model metrics via the ArViZ library.

Type:

ArViZ.InferenceData

References

generate_rfm_data(size: int = 1000) DataFrame#

Generate synthetic RFM data from fitted model parameters. Useful for posterior predictive checks of model performance.

Parameters:

size (int) – Rows of synthetic RFM data to generate. Default is 1000.

Returns:

self.synthetic_df – dataframe containing [“frequency”, “recency”, “T”, “lambda”, “p”, “alive”, “customer_id”] columns.

Return type:

pd.DataFrame

class btyd.ModifiedBetaGeoFitter(penalizer_coef=0.0)#

Also known as the MBG/NBD model.

Based on [5]_, [6]_, this model has the following assumptions: 1) Each individual, i, has a hidden lambda_i and p_i parameter 2) These come from a population wide Gamma and a Beta distribution

respectively.

  1. Individuals purchases follow a Poisson process with rate \(\lambda_i*t\) .

  2. At the beginning of their lifetime and after each purchase, an individual has a p_i probability of dieing (never buying again).

References

penalizer_coef#

The coefficient applied to an l2 norm on the parameters

Type:

float

params_#

The fitted parameters of the model

Type:
obj:

Series

data#

A DataFrame with the values given in the call to fit

Type:
obj:

DataFrame

variance_matrix_#

A DataFrame with the variance matrix of the parameters.

Type:
obj:

DataFrame

confidence_intervals_#

A DataFrame 95% confidence intervals of the parameters

Type:
obj:

DataFrame

standard_errors_#

A Series with the standard errors of the parameters

Type:
obj:

Series

summary#

A DataFrame containing information about the fitted parameters

Type:
obj:

DataFrame

conditional_expected_number_of_purchases_up_to_time(t, frequency, recency, T)#

Conditional expected number of repeat purchases up to time t.

Calculate the expected number of repeat purchases up to time t for a randomly choose individual from the population, given they have purchase history (frequency, recency, T) See Wagner, U. and Hoppe D. (2008).

Parameters:
  • t (array_like) – times to calculate the expectation for.

  • frequency (array_like) – historical frequency of customer.

  • recency (array_like) – historical recency of customer.

  • T (array_like) – age of the customer.

Return type:

array_like

conditional_probability_alive(frequency, recency, T)#

Conditional probability alive.

Compute the probability that a customer with history (frequency, recency, T) is currently alive. From https://www.researchgate.net/publication/247219660_Empirical_validation_and_comparison_of_models_for_customer_base_analysis Appendix A, eq. (5)

Parameters:
  • frequency (array or float) – historical frequency of customer.

  • recency (array or float) – historical recency of customer.

  • T (array or float) – age of the customer.

Returns:

value representing probability of being alive

Return type:

array

expected_number_of_purchases_up_to_time(t)#

Return expected number of repeat purchases up to time t.

Calculate the expected number of repeat purchases up to time t for a randomly choose individual from the population.

Parameters:

t (array_like) – times to calculate the expectation for

Return type:

array_like

fit(frequency, recency, T, weights=None, initial_params=None, verbose=False, tol=1e-07, index=None, **kwargs)#

Fit the data to the MBG/NBD model.

Parameters:
  • frequency (array_like) – the frequency vector of customers’ purchases (denoted x in literature).

  • recency (array_like) – the recency vector of customers’ purchases (denoted t_x in literature).

  • T (array_like) – customers’ age (time units since first purchase)

  • weights (None or array_like) – Number of customers with given frequency/recency/T, defaults to 1 if not specified. Fader and Hardie condense the individual RFM matrix into all observed combinations of frequency/recency/T. This parameter represents the count of customers with a given purchase pattern. Instead of calculating individual log-likelihood, the log-likelihood is calculated for each pattern and multiplied by the number of customers with that pattern.

  • verbose (bool, optional) – set to true to print out convergence diagnostics.

  • tol (float, optional) – tolerance for termination of the function minimization process.

  • index (array_like, optional) – index for resulted DataFrame which is accessible via self.data

  • kwargs – key word arguments to pass to the scipy.optimize.minimize function as options dict

Returns:

With additional properties and methods like params_ and predict

Return type:

ModifiedBetaGeoFitter

probability_of_n_purchases_up_to_time(t, n)#

Compute the probability of n purchases up to time t.

\[P( N(t) = n | \text{model} )\]

where N(t) is the number of repeat purchases a customer makes in t units of time.

Parameters:
  • t (float) – number units of time

  • n (int) – number of purchases

Returns:

Probability to have n purchases up to t units of time

Return type:

float

class btyd.ParetoNBDFitter(penalizer_coef=0.0)#

Pareto NBD fitter [7].

Parameters:

penalizer_coef (float) – The coefficient applied to an l2 norm on the parameters

penalizer_coef#

The coefficient applied to an l2 norm on the parameters

Type:

float

params_#

The fitted parameters of the model

Type:
obj:

OrderedDict

data#

A DataFrame with the columns given in the call to fit

Type:
obj:

DataFrame

References

conditional_expected_number_of_purchases_up_to_time(t, frequency, recency, T)#

Conditional expected number of purchases up to time.

Calculate the expected number of repeat purchases up to time t for a randomly choose individual from the population, given they have purchase history (frequency, recency, T).

This is equation (41) from: http://brucehardie.com/notes/009/pareto_nbd_derivations_2005-11-05.pdf

Parameters:
  • t (array_like) – times to calculate the expectation for.

  • frequency (array_like) – historical frequency of customer.

  • recency (array_like) – historical recency of customer.

  • T (array_like) – age of the customer.

Return type:

array_like

conditional_probability_alive(frequency, recency, T)#

Conditional probability alive.

Compute the probability that a customer with history (frequency, recency, T) is currently alive.

Section 5.1 from (equations (36) and (37)): http://brucehardie.com/notes/009/pareto_nbd_derivations_2005-11-05.pdf

Parameters:
  • frequency (float) – historical frequency of customer.

  • recency (float) – historical recency of customer.

  • T (float) – age of the customer.

Returns:

value representing a probability

Return type:

float

conditional_probability_alive_matrix(max_frequency=None, max_recency=None)#

Compute the probability alive matrix.

Builds on the conditional_probability_alive() method.

Parameters:
  • max_frequency (float, optional) – the maximum frequency to plot. Default is max observed frequency.

  • max_recency (float, optional) – the maximum recency to plot. This also determines the age of the customer. Default to max observed age.

Returns:

A matrix of the form [t_x: historical recency, x: historical frequency]

Return type:

matrix

conditional_probability_of_being_alive_up_to_time(t, frequency, recency, T)#

Conditional probability of being alive up to time T+t.

Compute the probability that a customer with history (frequency, recency, T) is still alive up to time T+t, given they have

purchase history (frequency, recency, T).

From paper: http://www.brucehardie.com/notes/015/additional_pareto_nbd_results.pdf

Parameters:
  • t (int) – time up to which probability should be calculated.

  • frequency (float) – historical frequency of customer.

  • recency (float) – historical recency of customer.

  • T (float) – age of the customer.

Returns:

value representing a probability

Return type:

float

conditional_probability_of_n_purchases_up_to_time(n, t, frequency, recency, T)#

Return conditional probability of n purchases up to time t.

Calculate the probability of n purchases up to time t for an individual with history frequency, recency and T (age).

The main equation being implemented is (16) from: http://www.brucehardie.com/notes/028/pareto_nbd_conditional_pmf.pdf

Parameters:
  • n (int) – number of purchases.

  • t (a scalar) – time up to which probability should be calculated.

  • frequency (float) – historical frequency of customer.

  • recency (float) – historical recency of customer.

  • T (float) – age of the customer.

Return type:

array_like

expected_number_of_purchases_up_to_time(t)#

Return expected number of repeat purchases up to time t.

Calculate the expected number of repeat purchases up to time t for a randomly choose individual from the population.

Equation (27) from: http://brucehardie.com/notes/009/pareto_nbd_derivations_2005-11-05.pdf

Parameters:

t (array_like) – times to calculate the expectation for.

Return type:

array_like

fit(frequency, recency, T, weights=None, iterative_fitting=1, initial_params=None, verbose=False, tol=0.0001, index=None, fit_method='Nelder-Mead', maxiter=2000, **kwargs)#

Pareto/NBD model fitter.

Parameters:
  • frequency (array_like) – the frequency vector of customers’ purchases (denoted x in literature).

  • recency (array_like) – the recency vector of customers’ purchases (denoted t_x in literature).

  • T (array_like) – customers’ age (time units since first purchase)

  • weights (None or array_like) – Number of customers with given frequency/recency/T, defaults to 1 if not specified. Fader and Hardie condense the individual RFM matrix into all observed combinations of frequency/recency/T. This parameter represents the count of customers with a given purchase pattern. Instead of calculating individual log-likelihood, the log-likelihood is calculated for each pattern and multiplied by the number of customers with that pattern.

  • iterative_fitting (int, optional) – perform iterative_fitting fits over random/warm-started initial params

  • initial_params (array_like, optional) – set the initial parameters for the fitter.

  • verbose (bool, optional) – set to true to print out convergence diagnostics.

  • tol (float, optional) – tolerance for termination of the function minimization process.

  • index (array_like, optional) – index for resulted DataFrame which is accessible via self.data

  • fit_method (string, optional) – fit_method to passing to scipy.optimize.minimize

  • maxiter (int, optional) – max iterations for optimizer in scipy.optimize.minimize will be overwritten if set in kwargs.

  • kwargs – key word arguments to pass to the scipy.optimize.minimize function as options dict

Returns:

with additional properties like params_ and methods like predict

Return type:

ParetoNBDFitter

plotting module#

btyd.plotting.plot_calibration_purchases_vs_holdout_purchases(model, calibration_holdout_matrix, kind='frequency_cal', n=7, **kwargs)#

Plot calibration purchases vs holdout.

This currently relies too much on the BTYD.util calibration_and_holdout_data function.

Parameters:
  • model (BTYD model) – A fitted BTYD model.

  • calibration_holdout_matrix (pandas DataFrame) – DataFrame from calibration_and_holdout_data function.

  • kind (str, optional) –

    x-axis :”frequency_cal”. Purchases in calibration period,

    ”recency_cal”. Age of customer at last purchase, “T_cal”. Age of customer at the end of calibration period, “time_since_last_purchase”. Time since user made last purchase

  • n (int, optional) – Number of ticks on the x axis

Returns:

axes

Return type:

matplotlib.AxesSubplot

btyd.plotting.plot_cumulative_transactions(model, transactions, datetime_col, customer_id_col, t, t_cal, datetime_format=None, freq='D', set_index_date=False, title='Tracking Cumulative Transactions', xlabel='day', ylabel='Cumulative Transactions', ax=None, **kwargs)#

Plot a figure of the predicted and actual cumulative transactions of users.

Parameters:
  • model (BTYD model) – A fitted BTYD model

  • transactions (pandas DataFrame) – DataFrame containing the transactions history of the customer_id

  • datetime_col (str) – The column in transactions that denotes the datetime the purchase was made.

  • customer_id_col (str) – The column in transactions that denotes the customer_id

  • t (float) – The number of time units since the begining of data for which we want to calculate cumulative transactions

  • t_cal (float) – A marker used to indicate where the vertical line for plotting should be.

  • datetime_format (str, optional) – A string that represents the timestamp format. Useful if Pandas can’t understand the provided format.

  • freq (str, optional) – Default ‘D’ for days, ‘W’ for weeks, ‘M’ for months… etc. Full list here: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#dateoffset-objects

  • set_index_date (bool, optional) – When True set date as Pandas DataFrame index, default False - number of time units

  • title (str, optional) – Figure title

  • xlabel (str, optional) – Figure xlabel

  • ylabel (str, optional) – Figure ylabel

  • ax (matplotlib.AxesSubplot, optional) – Using user axes

  • kwargs – Passed into the pandas.DataFrame.plot command.

Returns:

axes

Return type:

matplotlib.AxesSubplot

btyd.plotting.plot_dropout_rate_heterogeneity(model, suptitle='Heterogeneity in Dropout Probability', xlabel='Dropout Probability p', ylabel='Density', suptitle_fontsize=14, **kwargs)#

Plot the estimated beta distribution of p.

p - (customers’ probability of dropping out immediately after a transaction).

Parameters:
  • model (BTYD model) – A fitted BTYD model, for now only for BG/NBD

  • suptitle (str, optional) – Figure suptitle

  • xlabel (str, optional) – Figure xlabel

  • ylabel (str, optional) – Figure ylabel

  • kwargs – Passed into the matplotlib.pyplot.plot command.

Returns:

axes

Return type:

matplotlib.AxesSubplot

btyd.plotting.plot_expected_repeat_purchases(model, title='Expected Number of Repeat Purchases per Customer', xlabel='Time Since First Purchase', ax=None, label=None, **kwargs)#

Plot expected repeat purchases on calibration period .

Parameters:
  • model (BTYD model) – A fitted BTYD model.

  • max_frequency (int, optional) – The maximum frequency to plot.

  • title (str, optional) – Figure title

  • xlabel (str, optional) – Figure xlabel

  • ax (matplotlib.AxesSubplot, optional) – Using user axes

  • label (str, optional) – Label for plot.

  • kwargs – Passed into the matplotlib.pyplot.plot command.

Returns:

axes

Return type:

matplotlib.AxesSubplot

btyd.plotting.plot_frequency_recency_matrix(model, T=1, max_frequency=None, max_recency=None, title=None, xlabel="Customer's Historical Frequency", ylabel="Customer's Recency", **kwargs)#

Plot recency frequecy matrix as heatmap.

Plot a figure of expected transactions in T next units of time by a customer’s frequency and recency.

Parameters:
  • model (BTYD model) – A fitted BTYD model.

  • T (fload, optional) – Next units of time to make predictions for

  • max_frequency (int, optional) – The maximum frequency to plot. Default is max observed frequency.

  • max_recency (int, optional) – The maximum recency to plot. This also determines the age of the customer. Default to max observed age.

  • title (str, optional) – Figure title

  • xlabel (str, optional) – Figure xlabel

  • ylabel (str, optional) – Figure ylabel

  • kwargs – Passed into the matplotlib.imshow command.

Returns:

axes

Return type:

matplotlib.AxesSubplot

btyd.plotting.plot_history_alive(model, t, transactions, datetime_col, freq='D', start_date=None, ax=None, **kwargs)#

Draw a graph showing the probability of being alive for a customer in time.

Parameters:
  • model (BTYD model) – A fitted BTYD model.

  • t (int) – the number of time units since the birth we want to draw the p_alive

  • transactions (pandas DataFrame) – DataFrame containing the transactions history of the customer_id

  • datetime_col (str) – The column in the transactions that denotes the datetime the purchase was made

  • freq (str, optional) – Default ‘D’ for days. Other examples= ‘W’ for weekly

  • start_date (datetime, optional) – Limit xaxis to start date

  • ax (matplotlib.AxesSubplot, optional) – Using user axes

  • kwargs – Passed into the matplotlib.pyplot.plot command.

Returns:

axes

Return type:

matplotlib.AxesSubplot

btyd.plotting.plot_incremental_transactions(model, transactions, datetime_col, customer_id_col, t, t_cal, datetime_format=None, freq='D', set_index_date=False, title='Tracking Daily Transactions', xlabel='day', ylabel='Transactions', ax=None, **kwargs)#

Plot a figure of the predicted and actual incremental transactions of users.

Parameters:
  • model (BTYD model) – A fitted BTYD model

  • transactions (pandas DataFrame) – DataFrame containing the transactions history of the customer_id

  • datetime_col (str) – The column in transactions that denotes the datetime the purchase was made.

  • customer_id_col (str) – The column in transactions that denotes the customer_id

  • t (float) – The number of time units since the begining of data for which we want to calculate cumulative transactions

  • t_cal (float) – A marker used to indicate where the vertical line for plotting should be.

  • datetime_format (str, optional) – A string that represents the timestamp format. Useful if Pandas can’t understand the provided format.

  • freq (str, optional) – Default ‘D’ for days, ‘W’ for weeks, ‘M’ for months… etc. Full list here: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#dateoffset-objects

  • set_index_date (bool, optional) – When True set date as Pandas DataFrame index, default False - number of time units

  • title (str, optional) – Figure title

  • xlabel (str, optional) – Figure xlabel

  • ylabel (str, optional) – Figure ylabel

  • ax (matplotlib.AxesSubplot, optional) – Using user axes

  • kwargs – Passed into the pandas.DataFrame.plot command.

Returns:

axes

Return type:

matplotlib.AxesSubplot

btyd.plotting.plot_period_transactions(model, max_frequency=7, title='Frequency of Repeat Transactions', xlabel='Number of Calibration Period Transactions', ylabel='Customers', **kwargs)#

Plot a figure with period actual and predicted transactions.

Parameters:
  • model (BTYD model) – A fitted BTYD model.

  • max_frequency (int, optional) – The maximum frequency to plot.

  • title (str, optional) – Figure title

  • xlabel (str, optional) – Figure xlabel

  • ylabel (str, optional) – Figure ylabel

  • kwargs – Passed into the matplotlib.pyplot.plot command.

Returns:

axes

Return type:

matplotlib.AxesSubplot

btyd.plotting.plot_probability_alive_matrix(model, max_frequency=None, max_recency=None, title='Probability Customer is Alive,\nby Frequency and Recency of a Customer', xlabel="Customer's Historical Frequency", ylabel="Customer's Recency", **kwargs)#

Plot probability alive matrix as heatmap.

Plot a figure of the probability a customer is alive based on their frequency and recency.

Parameters:
  • model (BTYD model) – A fitted BTYD model.

  • max_frequency (int, optional) – The maximum frequency to plot. Default is max observed frequency.

  • max_recency (int, optional) – The maximum recency to plot. This also determines the age of the customer. Default to max observed age.

  • title (str, optional) – Figure title

  • xlabel (str, optional) – Figure xlabel

  • ylabel (str, optional) – Figure ylabel

  • kwargs – Passed into the matplotlib.imshow command.

Returns:

axes

Return type:

matplotlib.AxesSubplot

btyd.plotting.plot_transaction_rate_heterogeneity(model, suptitle='Heterogeneity in Transaction Rate', xlabel='Transaction Rate', ylabel='Density', suptitle_fontsize=14, **kwargs)#

Plot the estimated gamma distribution of lambda (customers’ propensities to purchase).

Parameters:
  • model (BTYD model) – A fitted BTYD model, for now only for BG/NBD

  • suptitle (str, optional) – Figure suptitle

  • xlabel (str, optional) – Figure xlabel

  • ylabel (str, optional) – Figure ylabel

  • kwargs – Passed into the matplotlib.pyplot.plot command.

Returns:

axes

Return type:

matplotlib.AxesSubplot

generate_data module#

btyd.generate_data.beta_geometric_beta_binom_model(N, alpha, beta, gamma, delta, size=1)#

Generate artificial data according to the Beta-Geometric/Beta-Binomial Model.

You may wonder why we can have frequency = n_periods, when frequency excludes their first order. When a customer purchases something, they are born, _and in the next period_ we start asking questions about their alive-ness. So really they customer has bought frequency + 1, and been observed for n_periods + 1

Parameters:
  • N (array_like) – Number of transaction opportunities for new customers.

  • alpha (float) – Parameters in the model. See [1]_

  • beta (float) – Parameters in the model. See [1]_

  • gamma (float) – Parameters in the model. See [1]_

  • delta (float) – Parameters in the model. See [1]_

  • size (int, optional) – The number of customers to generate

Returns:

with index as customer_ids and the following columns: ‘frequency’, ‘recency’, ‘n_periods’, ‘lambda’, ‘p’, ‘alive’, ‘customer_id’

Return type:

DataFrame

References

btyd.generate_data.beta_geometric_nbd_model(T, r, alpha, a, b, size=1)#

Generate artificial data according to the BG/NBD model.

See [1] for model details

Parameters:
  • T (array_like) – The length of time observing new customers.

  • r (float) – Parameters in the model. See [1]_

  • alpha (float) – Parameters in the model. See [1]_

  • a (float) – Parameters in the model. See [1]_

  • b (float) – Parameters in the model. See [1]_

  • size (int, optional) – The number of customers to generate

Returns:

With index as customer_ids and the following columns: ‘frequency’, ‘recency’, ‘T’, ‘lambda’, ‘p’, ‘alive’, ‘customer_id’

Return type:

DataFrame

References

btyd.generate_data.beta_geometric_nbd_model_transactional_data(T, r, alpha, a, b, observation_period_end='2019-1-1', freq='D', size=1)#

Generate artificial transactional data according to the BG/NBD model.

See [1] for model details

Parameters:
  • T (int, float or array_like) – The length of time observing new customers.

  • r (float) – Parameters in the model. See [1]_

  • alpha (float) – Parameters in the model. See [1]_

  • a (float) – Parameters in the model. See [1]_

  • b (float) – Parameters in the model. See [1]_

  • observation_period_end (date_like) – The date observation ends

  • freq (string, optional) – Default ‘D’ for days, ‘W’ for weeks, ‘h’ for hours

  • size (int, optional) – The number of customers to generate

Returns:

The following columns: ‘customer_id’, ‘date’

Return type:

DataFrame

References

btyd.generate_data.modified_beta_geometric_nbd_model(T, r, alpha, a, b, size=1)#

Generate artificial data according to the MBG/NBD model.

See [3]_, [4]_ for model details

Parameters:
  • T (array_like) – The length of time observing new customers.

  • r (float) – Parameters in the model. See [1]_

  • alpha (float) – Parameters in the model. See [1]_

  • a (float) – Parameters in the model. See [1]_

  • b (float) – Parameters in the model. See [1]_

  • size (int, optional) – The number of customers to generate

Returns:

with index as customer_ids and the following columns: ‘frequency’, ‘recency’, ‘T’, ‘lambda’, ‘p’, ‘alive’, ‘customer_id’

Return type:

DataFrame

References

btyd.generate_data.pareto_nbd_model(T, r, alpha, s, beta, size=1)#

Generate artificial data according to the Pareto/NBD model.

See [2]_ for model details.

Parameters:
  • T (array_like) – The length of time observing new customers.

  • r (float) – Parameters in the model. See [1]_

  • alpha (float) – Parameters in the model. See [1]_

  • s (float) – Parameters in the model. See [1]_

  • beta (float) – Parameters in the model. See [1]_

  • size (int, optional) – The number of customers to generate

Returns:

with index as customer_ids and the following columns: ‘frequency’, ‘recency’, ‘T’, ‘lambda’, ‘mu’, ‘alive’, ‘customer_id’

Return type:

obj: DataFrame

References

datasets module#

btyd.datasets.load_cdnow_summary(**kwargs)#

Load cdnow customers summary pandas DataFrame.

btyd.datasets.load_cdnow_summary_data_with_monetary_value(**kwargs)#

Load cdnow customers summary with monetary value as pandas DataFrame.

btyd.datasets.load_donations(**kwargs)#

Load donations dataset as pandas DataFrame.

btyd.datasets.load_transaction_data(**kwargs)#

Return a Pandas dataframe of transactional data.

Looks like:

date id

0 2014-03-08 00:00:00 0 1 2014-05-21 00:00:00 1 2 2014-03-14 00:00:00 2 3 2014-04-09 00:00:00 2 4 2014-05-21 00:00:00 2

The data was artificially created using BTYD data generation routines. Data was generated between 2014-01-01 to 2014-12-31.