org.knowceans.util
Class Samplers

java.lang.Object
  extended by org.knowceans.util.Samplers

public class Samplers
extends java.lang.Object

Diverse sampling methods, including beta, gamma, multinomial, and Dirichlet distributions as well as Dirichlet processes, using Sethurahman's stick-breaking construction and Chinese restaurant process. The random generator used is a Mersenne Twister (Cokus), which is the only dependency.

FIXME: markov condition in random generator, see random string?

Author:
heinrich (partly adapted from Yee Whye Teh's npbayes Matlab / C code)

Nested Class Summary
static class Samplers.CrpData
          data structure for a Chinese restaurant process CrpData
 
Field Summary
static double lastRand
           
 
Constructor Summary
Samplers()
           
 
Method Summary
static int binarySearch(double[] a, double p)
          perform a binary search and return the first index i at which a[i] >= p.
static double enumClass(double alpha, int numdata)
          enumclass(alpha,numdata) The expected number of tables in a CRP with concentration parameter alpha and numdata items.
static void main(java.lang.String[] args)
           
static double meanLik(double lik)
          meanlik(lik) Computes estimated likelihood from individual samples.
static int randAntoniak(double alpha, int n)
          sample number of components m that a DP(alpha, G0) has after n samples.
static int randBernoulli(double p)
          draw a Bernoulli sample.
static double[] randBeta(double[] aa, double[] bb)
          randbeta(aa, bb) Generates beta samples, one for each element in aa/bb, and scale 1.
static double randBeta(double aa, double bb)
          beta as two-dimensional Dirichlet
static int randBinom(double N, double p)
          draw a binomial sample (by counting Bernoulli samples).
static double randConParam(double alpha, int numgroup, int[] numdata, int[] numtable, double alphaa, double alphab, int numiter)
          randconparam(alpha,numdata,numclass,aa,bb,numiter) Generates a sample from a concentration parameter of a HDP with gamma(aa,bb) prior, and number of classes and data items given in numdata, numclass (has to be row vectors).
static double randConParam(double alpha, int numdata, int numtopic, double alphaa, double alphab, int numiter)
          Sample the Dirichlet process concetration parameter given the topic and data counts and gamma hyperparameters alphaa and alphab.
static Samplers.CrpData randCrp(double[] alpha, int numdata)
          [cc numclass] = randcrp(alpha,numdata) Generates a partition of numdata items with concentration parameter alpha, which can be an array, in which case the Chinese restaurant process has "two new tables to chose for each new customer".
static Samplers.CrpData randCrp(double alpha, int numdata)
           
static double[] randDir(double[] aa)
          randdir(aa) generates one Dirichlet sample vector according to the parameters alpha.
static double[][] randDir(double[][] aa, int direction)
          Generate as many Dirichlet column samples as there are columns (direction = 1; randdir(A, 1)) or row samples as there are rows (direction = 2, randdir(A, 2)) in aa (aa[][]), taking the respective parameters.
static double[] randDir(double[] mean, double precision)
          randdir(aa) generates one Dirichlet sample vector according to the parameters alpha.
static double[][] randDir(double[] aa, int repetitions)
          Generate n Dirichlet samples taking parameters aa.
static double[] randDir(double a, int dimension)
          symmetric Dirichlet sample.
static double[] randDmm(double[] probs, double[][] mean, double[] precision)
          DMM sampling
static double[] randDmm(double[] probs, double[][] mean, double[] precision, int[] component)
          DMM sampling
static double[][] randDmm(int n, double[] probs, double[][] means, double[] precisions)
          DMM sampling
static double[][] randDmm(int n, double[] probs, double[][] means, double[] precisions, int[] components)
          DMM sampling
static double randGamma(double rr)
          self-contained gamma generator.
static double[] randGamma(double[] aa)
          randgamma(aa) Generates gamma samples, one for each element in aa.
static double randGamma(double shape, double scale)
          sample from gamma distribution with defined shape a and scale b:
static double randGmm(double[] probs, double[] mean, double[] sigma)
          GMM sampling
static double randGmm(double[] probs, double[] mean, double[] sigma, int[] component)
          GMM sampling
static double[] randGmm(int n, double[] probs, double[] mean, double[] sigma)
          GMM sampling
static double[] randGmm(int n, double[] probs, double[] mean, double[] sigma, int[] components)
          GMM sampling
static int randMult(double[] pp)
          Creates one multinomial sample given the parameter vector pp.
static int[] randMult(double[] pp, int repetitions)
          Multiply sample a multinomial distribution and return a vector with all samples.
static int randMultDirect(double[] pp)
          Creates one multinomial sample given the parameter vector pp.
static int randMultDirect(double[] pp, double rand)
          Like randMultDirect, but the random number is given as argument.
static int[] randMultFreqs(double[] pp, int repetitions)
          Multiply sample a multinomial distribution and return a vector with category frequencies.
static int randMultSimple(double[] pp)
          old version of the randMult method
static double randNorm(double mu, double sigma)
          uses same approach as java.util.Random()
static int randNumTable(double alpha, int numdata)
          randnumtable(weights,maxtable) For each entry in weights and maxtables, generates the number of tables given concentration parameter (weights) and number of data items (maxtable).
static int[] randPerm(int size)
          Random permutation of size elements (symbols '0'..
 int[] randPerm(int[] set)
          Random permutation of existing set of integers.
static double[] randStick(double[] alpha, int numclass)
          randstick(alpha,numclass) Generates stick-breaking weights with concentration parameter for numclass "sticks".
static java.lang.String randString(int length, byte[] alphabet)
          create a random string of length alphanumeric characters.
static double randUniform(double numvalue)
           
static int randUniform(int numvalue)
           
static double[] stirling(int nn)
          [ss lmss] = stirling(nn) Gives unsigned Stirling numbers of the first kind s(nn,*) in ss.
static void testMult()
           
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

lastRand

public static double lastRand
Constructor Detail

Samplers

public Samplers()
Method Detail

main

public static void main(java.lang.String[] args)

randNorm

public static double randNorm(double mu,
                              double sigma)
uses same approach as java.util.Random()

Parameters:
mu -
sigma -
Returns:

randGmm

public static double randGmm(double[] probs,
                             double[] mean,
                             double[] sigma)
GMM sampling

Parameters:
probs - mixture responsibilities
mean - mean vector
sigma - stddev vector
Returns:

randGmm

public static double randGmm(double[] probs,
                             double[] mean,
                             double[] sigma,
                             int[] component)
GMM sampling

Parameters:
probs - mixture responsibilities
mean - mean vector
sigma - stddev vector
component - [out] componentn[0] is filled with the sampled component index
Returns:

randGmm

public static double[] randGmm(int n,
                               double[] probs,
                               double[] mean,
                               double[] sigma)
GMM sampling

Parameters:
n - number of samples to take (this saves the calculation of the cumulative probabilities for successive trials)
probs - mixture responsibilities
mean - mean vector
sigma - stddev vector
Returns:

randGmm

public static double[] randGmm(int n,
                               double[] probs,
                               double[] mean,
                               double[] sigma,
                               int[] components)
GMM sampling

Parameters:
n - number of samples to take (this saves the calculation of the cumulative probabilities for successive trials)
probs - mixture responsibilities
mean - mean vector
sigma - stddev vector
components - [out] n-vector is filled with the sampled component indices (ignored if null)
Returns:

randDmm

public static double[] randDmm(double[] probs,
                               double[][] mean,
                               double[] precision)
DMM sampling

Parameters:
probs - mixture responsibilities
mean - mean vector of vectors
precision - precision vector
Returns:

randDmm

public static double[] randDmm(double[] probs,
                               double[][] mean,
                               double[] precision,
                               int[] component)
DMM sampling

Parameters:
probs - mixture responsibilities
mean - mean vector of vectors
precision - precision vector
component - [out] sampled component of the mixture (or ignored if null)
Returns:

randDmm

public static double[][] randDmm(int n,
                                 double[] probs,
                                 double[][] means,
                                 double[] precisions)
DMM sampling

Parameters:
probs - mixture responsibilities
means - mean vector of vectors
precisions - precision vector
Returns:

randDmm

public static double[][] randDmm(int n,
                                 double[] probs,
                                 double[][] means,
                                 double[] precisions,
                                 int[] components)
DMM sampling

Parameters:
n - number of trials
probs - mixture responsibilities
means - mean vector of vectors
precisions - precision vector
components - n-vector is filled with the sampled component indices (ignored if null)
Returns:

randBeta

public static double randBeta(double aa,
                              double bb)
beta as two-dimensional Dirichlet

Parameters:
aa -
bb -
Returns:

randBeta

public static double[] randBeta(double[] aa,
                                double[] bb)
randbeta(aa, bb) Generates beta samples, one for each element in aa/bb, and scale 1.

Parameters:
aa -

randGamma

public static double randGamma(double rr)
self-contained gamma generator. Multiply result with scale parameter (or divide by rate parameter). After Teh (npbayes).

Parameters:
rr - shape parameter
Returns:

randGamma

public static double[] randGamma(double[] aa)
randgamma(aa) Generates gamma samples, one for each element in aa.

Parameters:
aa -

randGamma

public static double randGamma(double shape,
                               double scale)
sample from gamma distribution with defined shape a and scale b:

x ~ x^(a-1) * exp(-x/b) / ( gamma(a) * b^a )

E(x) = ab, V(x) = (ab)^2. Note that instead of the scale parameter b, often a rate parameter r = 1/b is used: E(x) = a/r, V(x) = (a/r)^2. For sampling, the following are equivalent: Gamma(a,1)*b <=> Gamma(a,b), with shape parametrisation; Gamma(a,1)/r <=> Gamma(a,r) with rate parametrisation.

Parameters:
shape -
scale -
Returns:

randPerm

public static int[] randPerm(int size)
Random permutation of size elements (symbols '0'.. '[size-1]'). This works a bit like sampling without replacement or a factorial.

Parameters:
size -
Returns:

randPerm

public final int[] randPerm(int[] set)
Random permutation of existing set of integers.

Parameters:
set -
Returns:

randDir

public static double[] randDir(double a,
                               int dimension)
symmetric Dirichlet sample.

Parameters:
aa -
Returns:

randDir

public static double[] randDir(double[] aa)
randdir(aa) generates one Dirichlet sample vector according to the parameters alpha. ORIG: Generates Dirichlet samples, with weights given in aa. The output sums to 1 along normdim, and each such sum corresponds to one Dirichlet sample.

Parameters:
aa -
normdim -
Returns:

randDir

public static double[] randDir(double[] mean,
                               double precision)
randdir(aa) generates one Dirichlet sample vector according to the parameters alpha.

Parameters:
mean - (mean_i = alpha_i / sum_j alpha_j)
precision - (precision = alpha_i / mean_i)
Returns:

randDir

public static double[][] randDir(double[][] aa,
                                 int direction)
Generate as many Dirichlet column samples as there are columns (direction = 1; randdir(A, 1)) or row samples as there are rows (direction = 2, randdir(A, 2)) in aa (aa[][]), taking the respective parameters. After Teh (npbayes).

Parameters:
aa -
direction - -- 2 is more efficient (row-major Java matrix structure)
Returns:

randDir

public static double[][] randDir(double[] aa,
                                 int repetitions)
Generate n Dirichlet samples taking parameters aa.

Parameters:
aa -
Returns:

randMultFreqs

public static int[] randMultFreqs(double[] pp,
                                  int repetitions)
Multiply sample a multinomial distribution and return a vector with category frequencies.

Parameters:
pp -
repetitions -
Returns:
vector of frequencies of the categories

randMult

public static int[] randMult(double[] pp,
                             int repetitions)
Multiply sample a multinomial distribution and return a vector with all samples.

Parameters:
pp -
repetitions -
Returns:
vector of all samples.

randMultSimple

public static int randMultSimple(double[] pp)
old version of the randMult method

Parameters:
pp -
Returns:

testMult

public static void testMult()

randMult

public static int randMult(double[] pp)
Creates one multinomial sample given the parameter vector pp. Each category is named after the index (0-based!) of the respective element of pp; Sometimes called categorical distribution (e.g., in BUGS). This version uses a binary search algorithm and does not require normalisation.


randMultDirect

public static int randMultDirect(double[] pp)
Creates one multinomial sample given the parameter vector pp. Each category is named after the index (0-based!) of the respective element of pp; Sometimes called categorical distribution (e.g., in BUGS). This version uses a binary search algorithm and does not require normalisation. Note that the parameters used directly changed because the multinomial is cumulated to save memory and copying time.


randMultDirect

public static int randMultDirect(double[] pp,
                                 double rand)
Like randMultDirect, but the random number is given as argument.


binarySearch

public static int binarySearch(double[] a,
                               double p)
perform a binary search and return the first index i at which a[i] >= p. Adapted from java.util.Arrays.binarySearch.

Parameters:
a -
p -
Returns:

randBinom

public static int randBinom(double N,
                            double p)
draw a binomial sample (by counting Bernoulli samples).

Parameters:
N -
p -

randBernoulli

public static int randBernoulli(double p)
draw a Bernoulli sample.

Parameters:
p - success probability
Returns:
1 if sucessful, 0 otherwise

randConParam

public static double randConParam(double alpha,
                                  int numgroup,
                                  int[] numdata,
                                  int[] numtable,
                                  double alphaa,
                                  double alphab,
                                  int numiter)
randconparam(alpha,numdata,numclass,aa,bb,numiter) Generates a sample from a concentration parameter of a HDP with gamma(aa,bb) prior, and number of classes and data items given in numdata, numclass (has to be row vectors). Uses auxiliary variable method, for numiter iterations.

Modification of Escobar and West. Works for multiple groups of data. numdata, numclass are row vectors, one element per group. After Teh (npbayes).

Parameters:
alpha - alpha
numgroup - number of components ??
numdata - number of data items per class
numtable - number of per DP
alphaa - hyperparameter (gamma shape)
alphab - hyperparameter (gamma scale)
numiter - number of iterations
Returns:

randConParam

public static double randConParam(double alpha,
                                  int numdata,
                                  int numtopic,
                                  double alphaa,
                                  double alphab,
                                  int numiter)
Sample the Dirichlet process concetration parameter given the topic and data counts and gamma hyperparameters alphaa and alphab. After Escobar and West (1995).

Parameters:
alpha -
numdata -
numtopic -
alphaa -
alphab -
numiter -
Returns:

randCrp

public static Samplers.CrpData randCrp(double alpha,
                                       int numdata)

randCrp

public static Samplers.CrpData randCrp(double[] alpha,
                                       int numdata)
[cc numclass] = randcrp(alpha,numdata) Generates a partition of numdata items with concentration parameter alpha, which can be an array, in which case the Chinese restaurant process has "two new tables to chose for each new customer". cc is sequence of indicator variables denoting which class each data item is in ("on which table each customer sits"), and numclass is the generated number of classes. After Teh (npbayes).

Parameters:
alpha -
numdata -
Returns:

randNumTable

public static int randNumTable(double alpha,
                               int numdata)
randnumtable(weights,maxtable) For each entry in weights and maxtables, generates the number of tables given concentration parameter (weights) and number of data items (maxtable). From npbayes-2.1. enumClass seems to be the expected value of randNumTable. After Teh (npbayes).

Parameters:
weights -
maxtable -
Returns:

randStick

public static double[] randStick(double[] alpha,
                                 int numclass)
randstick(alpha,numclass) Generates stick-breaking weights with concentration parameter for numclass "sticks". XXX: untested. After Teh (npbayes).

Parameters:
alpha -
numclass -
Returns:

enumClass

public static double enumClass(double alpha,
                               int numdata)
enumclass(alpha,numdata) The expected number of tables in a CRP with concentration parameter alpha and numdata items. After Teh (npbayes).

Parameters:
alpha -
numdata -
Returns:

randString

public static java.lang.String randString(int length,
                                          byte[] alphabet)
create a random string of length alphanumeric characters.

Parameters:
length - of output
alphabet - alphabet to be used or null
Returns:

meanLik

public static double meanLik(double lik)
meanlik(lik) Computes estimated likelihood from individual samples. Basically does a harmonic mean of lik in 3rd dimension, followed by normal mean in 2nd. After Teh (npbayes).

Parameters:
lik -
Returns:

stirling

public static double[] stirling(int nn)
[ss lmss] = stirling(nn) Gives unsigned Stirling numbers of the first kind s(nn,*) in ss. ss(i) = s(nn,i-1). ss is normalized so that maximum value is 1, and the log of normalization is given in lmss (static variable). After Teh (npbayes).

Parameters:
nn -
Returns:

randAntoniak

public static int randAntoniak(double alpha,
                               int n)
sample number of components m that a DP(alpha, G0) has after n samples. This was first published by Antoniak (1974). TODO: another check, as direct simulation of CRP tables produces higher results

Parameters:
alpha -
n -
Returns:

randUniform

public static double randUniform(double numvalue)
Parameters:
numclass -
Returns:

randUniform

public static int randUniform(int numvalue)
Parameters:
numclass -
Returns: