LsamGibbsSampler

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.knowceans.dirichlet.lsam
Class LsamGibbsSampler

java.lang.Object
  org.knowceans.dirichlet.lsam.LsamGibbsSampler

public class LsamGibbsSampler
extends java.lang.Object
extends java.lang.Object

Gibbs sampler for estimating the best assignments of topics for words and actors in a corpus with co-authoring information and personalised queries. The algorithm is based on Mark Steyvers et al. "Probabilistic Author-Topic Models for Information Discovery" (2004) and will be shown in G. H. "Probabilistic associative browsing for virtual expert communities" (2005).

TODO: What happens if one author is not searching for anything? TODO: How can demand be weighted other than within the bounds of pdf normalisation (introduce null topic by fill with 0 terms? use p(a|z)?) TODO: Weight influence of relations in phi.

Author:: heinrich

Field Summary
`(package private) int`	`A` total number of actors
`(package private) int[][]`	`actors` actor data (co-author lists and queriers)
`(package private) double`	`alpha` Dirichlet parameter (actor--topic associations)
`(package private) double`	`beta` Dirichlet parameter (topic--term associations)
`private static int`	`BURN_IN` burn-in period
`(package private) int[][][]`	`cat` cat[r][i][j] number of times actor i is assigned to topic j in relation r.
`(package private) int[][]`	`catsum` catsum[r][i] total number of word (=topic) assignments to actor i in relation r.
`(package private) int[][]`	`cwt` cwt[i][j] number of instances of word i (term) assigned to topic j.
`(package private) int[]`	`cwtsum` cwtsum[r][j] total number of words assigned to topic j.
`private static int`	`dispcol`
`private static int`	`ITERATIONS` max iterations
`(package private) int`	`K` number of topics
`(package private) static java.text.NumberFormat`	`lnf`
`(package private) int[][]`	`media` document data (term lists)
`(package private) int`	`numstats` size of statistics
`(package private) double[][]`	`phisum` cumulative statistics of phi
`(package private) double[][][]`	`psisum` cumulative statistics of psi (one per relation)
`(package private) int`	`R` total number of relation types
`static org.knowceans.util.CokusRandom`	`rand`
`(package private) int[]`	`relations` relation that each medium maps to
`private static int`	`SAMPLE_LAG` sample lag (if <=0 only one sample taken at the end)
`(package private) static java.lang.String[]`	`shades`
`private static int`	`THIN_INTERVAL` sampling lag (?)
`(package private) int`	`V` vocabulary size
`(package private) int[][]`	`x` actor assignments for each word.
`(package private) int[][]`	`z` topic assignments for each word.

Constructor Summary
`LsamGibbsSampler(int[][] documents, int[][] actors, int[] relations, int V, int A, int R)` Initialise the Gibbs sampler with data.

Method Summary
`void`	`configure(int iterations, int burnIn, int thinInterval, int sampleLag)` Configure the gibbs sampler
`double[][]`	`getPhi()` Retrieve estimated topic--word associations.
`double[][][]`	`getPsi()` Retrieve estimated actor--topic associations, one matrix for each relation.
`private void`	`gibbs(int K, double alpha, double beta)` Main method: Select initial state ?
`void`	`initialState(int K)` Initialisation: Must start with an assignment of observations to topics ?
`private static double`	`kl(double[] p, double[] q)` KL-divergence between distributions p and q.
`static void`	`main(java.lang.String[] args)` Driver with example data.
`private static double[][]`	`matchActors(double[][][] psi, int p, int q)` Build matrix of actor--actor matches according to the Kullback-Leibler-divergence, M_ij = KL(demand_i\|\|supply_j).
`private void`	`sampleFullConditional(int m, int n)` Sample an actor--topic pair (x_i, z_i) from the full conditional distribution: p(x_i = q,z_i = j\|z_-i, w, x_i, a_d, r_d) = (cwt_mj + beta)/(cwtsum_j + W * beta) * (cat_qjr + alpha)/(catsum_qr + K * alpha)
`static java.lang.String`	`shadeDouble(double d, double max)` create a string representation whose gray value appears as an indicator of magnitude, cf.
`static void`	`test(java.lang.String[] args)` test driver with a small synthetic data set
`private void`	`updateParams()` Add to the statistics the values of psi and phi for the current state.
`private static void`	`writeParameters(java.lang.String file, java.lang.String corpusname, int k, double alpha, double beta, int x, int m, int v, int w, long duration, int iterations, int samplelag, int burnin, org.knowceans.util.Arguments a)` write statistics of the current run to a text file for later review

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

media

int[][] media

document data (term lists)

actors

int[][] actors

actor data (co-author lists and queriers)

relations

int[] relations

relation that each medium maps to

V

int V

vocabulary size

A

int A

total number of actors

R

int R

total number of relation types

K

int K

number of topics

alpha

double alpha

Dirichlet parameter (actor--topic associations)

beta

double beta

Dirichlet parameter (topic--term associations)

z

int[][] z

topic assignments for each word. TODO: could be z[r][m][n] to facilitate multiple document relations

x

int[][] x

actor assignments for each word.

cwt

int[][] cwt

cwt[i][j] number of instances of word i (term) assigned to topic j.

cat

int[][][] cat

cat[r][i][j] number of times actor i is assigned to topic j in relation r.

cwtsum

int[] cwtsum

cwtsum[r][j] total number of words assigned to topic j.

catsum

int[][] catsum

catsum[r][i] total number of word (=topic) assignments to actor i in relation r.

numstats

int numstats

size of statistics

psisum

double[][][] psisum

cumulative statistics of psi (one per relation)

phisum

double[][] phisum

cumulative statistics of phi

THIN_INTERVAL

private static int THIN_INTERVAL

sampling lag (?)

BURN_IN

private static int BURN_IN

burn-in period

ITERATIONS

private static int ITERATIONS

max iterations

SAMPLE_LAG

private static int SAMPLE_LAG

sample lag (if <=0 only one sample taken at the end)

dispcol

private static int dispcol

rand

public static org.knowceans.util.CokusRandom rand

shades

static java.lang.String[] shades

lnf

static java.text.NumberFormat lnf

Constructor Detail

LsamGibbsSampler

public LsamGibbsSampler(int[][] documents,
                        int[][] actors,
                        int[] relations,
                        int V,
                        int A,
                        int R)

Initialise the Gibbs sampler with data.

Parameters:: media -; actors -; relations -; V -; A -; R -

Method Detail

initialState

public void initialState(int K)

Initialisation: Must start with an assignment of observations to topics ? Many alternatives are possible, I chose to perform random assignments with equal probabilities

Parameters:: K - number of topics

gibbs

private void gibbs(int K,
                   double alpha,
                   double beta)

Main method: Select initial state ? Repeat a large number of times: 1. Select an element 2. Update conditional on other elements. If appropriate, output summary for each run.

Parameters:: K - number of topics; alpha - symmetric prior parameter on document--topic associations; beta - symmetric prior parameter on topic--term associations

sampleFullConditional

private void sampleFullConditional(int m,
                                   int n)

Sample an actor--topic pair (x_i, z_i) from the full conditional distribution: p(x_i = q,z_i = j|z_-i, w, x_i, a_d, r_d) = (cwt_mj + beta)/(cwtsum_j + W * beta) * (cat_qjr + alpha)/(catsum_qr + K * alpha)

Parameters:: m - document; n - word

updateParams

private void updateParams()

Add to the statistics the values of psi and phi for the current state.

getPsi

public double[][][] getPsi()

Retrieve estimated actor--topic associations, one matrix for each relation. (In the LS-AMQ model with R=2, psi[0] is theta and psi[1] is xi). If sample lag > 0 then the mean value of all sampled statistics for psi[][][] is taken.

Returns:: psi multinomial mixture of author topics (R x A x K)

getPhi

public double[][] getPhi()

Retrieve estimated topic--word associations. If sample lag > 0 then the mean value of all sampled statistics for phi[][] is taken.

IDEA: Here only the supply relation could be considered by cwtsum[r][]

Returns:: phi multinomial mixture of topic words (K x V)

configure

public void configure(int iterations,
                      int burnIn,
                      int thinInterval,
                      int sampleLag)

Configure the gibbs sampler

Parameters:: iterations - number of total iterations; burnIn - number of burn-in iterations; thinInterval - update statistics interval; sampleLag - sample interval (-1 for just one sample at the end)

main

public static void main(java.lang.String[] args)

Driver with example data. TODO: implement averaging of phi and psi.

Parameters:: args -

writeParameters

private static void writeParameters(java.lang.String file,
                                    java.lang.String corpusname,
                                    int k,
                                    double alpha,
                                    double beta,
                                    int x,
                                    int m,
                                    int v,
                                    int w,
                                    long duration,
                                    int iterations,
                                    int samplelag,
                                    int burnin,
                                    org.knowceans.util.Arguments a)

write statistics of the current run to a text file for later review

Parameters:: file -; corpusname -; k - topics; alpha - hyperparameter; beta - hyperparameter; x - actor count; m - doc count; v - vocabulary/term count; w - word count; duration - training duration; iterations - no. of total iterations; samplelag - sampling lag; burnin - burnin samples; a - Arguments object

test

public static void test(java.lang.String[] args)

test driver with a small synthetic data set

Parameters:: args -

matchActors

private static double[][] matchActors(double[][][] psi,
                                      int p,
                                      int q)

Build matrix of actor--actor matches according to the Kullback-Leibler-divergence, M_ij = KL(demand_i||supply_j).

Parameters:: psi - matrix[][][] of actor--topic associations, by relations in first index; p - relation of left KL argument; q - relation of right KL argument
Returns:

kl

private static double kl(double[] p,
                         double[] q)

KL-divergence between distributions p and q. (The measure is asymmetric: KL(p||q) = int p(x) * log(p(x)/q(x)) dx).

Parameters:: p - discrete pdf; q - discrete pdf (q.length = p.length)
Returns:: KL divergence

shadeDouble

public static java.lang.String shadeDouble(double d,
                                           double max)

create a string representation whose gray value appears as an indicator of magnitude, cf. Hinton diagrams in statistics.

Parameters:: d - value; max - maximum value
Returns:

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.knowceans.dirichlet.lsam Class LsamGibbsSampler

media

actors

relations

V

A

R

K

alpha

beta

z

x

cwt

cat

cwtsum

catsum

numstats

psisum

phisum

THIN_INTERVAL

BURN_IN

ITERATIONS

SAMPLE_LAG

dispcol

rand

shades

lnf

LsamGibbsSampler

initialState

gibbs

sampleFullConditional

updateParams

getPsi

getPhi

configure

main

writeParameters

test

matchActors

kl

shadeDouble

org.knowceans.dirichlet.lsam
Class LsamGibbsSampler