|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.knowceans.corpus.analysis.VariationOfInformationAnalyser
public class VariationOfInformationAnalyser
VariationOfInformationAnalyser (old: TopicCorrelationAnalyser) analyses the distance between the extracted topics and a priori categories.
Probabilities of hierarchical topics need special consideration. If a document has a category on level n in the N-level hierarchy, it should be possible to reflect this by including levels 1..n-1, too. If the option hierup is set, we adopt the method to equally weight the categories w.r.t. document as opposed to w.r.t. topic, which means that if a document as for instance a level-2 topic A and a level-1 topic B, the set of topics is expanded to the pdf {A, parent_of_A, B} / 3, as opposed to {.5 * A, .5 * parent_of_A, 1 * B} / 2. Further, it is possible to include the level below the current hierarchy level in the same manner.
| Nested Class Summary | |
|---|---|
class |
VariationOfInformationAnalyser.DistMetric
DistMetric is a container for the values of the metric. |
| Field Summary | |
|---|---|
(package private) org.knowceans.map.HashMultiMap<java.lang.Integer,java.lang.Integer> |
catDocuments
sparse transpose of docCategories |
private java.lang.String |
comment
|
(package private) int[][] |
docCategories
docCategories sparse matrix (will be ) |
static boolean |
doDebug
|
private boolean |
hierdown
|
private boolean |
hierup
|
private boolean |
includeunknown
|
(package private) IptcCategories |
iptc
IPTC-Codes |
(package private) static double |
log2
basis |
(package private) int |
nCats
number of categories |
(package private) int |
nDocs
number of documents |
(package private) int |
nTopics
number of topics |
(package private) int |
nValidDocs
number of valid documents |
private java.lang.String |
outfile
|
private static long |
serialVersionUID
|
private double |
sumPCatDoc
|
(package private) double[][] |
theta
the document--topic associations (theta) |
| Constructor Summary | |
|---|---|
VariationOfInformationAnalyser(java.lang.String docsfile,
java.lang.String thetafile,
java.lang.String outfile,
java.lang.String comment,
boolean hierup,
boolean hierdown,
boolean includeunknown)
TopicCorrelationAnalyser |
|
| Method Summary | |
|---|---|
private void |
checkConsistency()
checks whether the object has a consistent state. |
(package private) double |
entropy(double[] p)
entropy of the distribution |
private void |
loadCategoryDists(java.lang.String file)
creates a matrix of a priori probabilities for each document. |
static void |
main(java.lang.String[] args)
|
VariationOfInformationAnalyser.DistMetric |
metric()
"Meila-metric" for a priori and a posteriori relationships. |
private double |
mutualInfo(double[] pcat,
double[] ptopic)
calculate mutual info for the two clusterings. |
private double |
mutualInfo(double[] pcat,
double[] ptopic,
double[][] pjoint)
calculate mutual info for the two clusterings if pjoint is known. |
double |
mylog(double arg)
|
double[] |
pCat()
the probability of categories given any document, n_c * sum_d p(c|d) |
(package private) double |
pCatForDoc(int cat,
int doc)
returns the probability of a category given the document. |
private double[][] |
pJoint()
calculate joint probability for the two clusterings. |
double[] |
pTopic()
averaged distributions n_d * sum_d p(z|d) |
private void |
saveCatTopics(double[][] pjoint,
double[] pcat,
java.lang.String file)
saves the 10 best topics for each category. |
void |
saveTopicCats(double[][] pjoint,
double[] ptopic,
java.lang.String file)
saves the 10 best categories for each topic: p(r | s) = p(r, s) / p(s) |
void |
setOutfile(java.lang.String outfile)
|
void |
setTheta(double[][] theta)
Set the current value of theta. |
double |
sum(double[] v)
|
double[][] |
transpose(double[][] mat)
transpose the matrix |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
private static final long serialVersionUID
public static boolean doDebug
static double log2
int[][] docCategories
org.knowceans.map.HashMultiMap<java.lang.Integer,java.lang.Integer> catDocuments
int nCats
int nDocs
int nValidDocs
int nTopics
double[][] theta
IptcCategories iptc
private java.lang.String outfile
private java.lang.String comment
private double sumPCatDoc
private boolean hierup
private boolean hierdown
private boolean includeunknown
| Constructor Detail |
|---|
public VariationOfInformationAnalyser(java.lang.String docsfile,
java.lang.String thetafile,
java.lang.String outfile,
java.lang.String comment,
boolean hierup,
boolean hierdown,
boolean includeunknown)
docsfile - docs file with category informationthetafile - theta.bin file or null if only metric(double[][] is
used)outfile - for topic-category table or nullcomment - comment in output fileshierup - whether hierarchical concepts (IPTC) aggregate their
parentshierdown - whether hierarchical concepts (IPTC) aggregate their
childrenincludeunknown - whether unknown concept descriptors are considered
valid concepts| Method Detail |
|---|
public static void main(java.lang.String[] args)
public void setTheta(double[][] theta)
theta - public void setOutfile(java.lang.String outfile)
public VariationOfInformationAnalyser.DistMetric metric()
D(X, Y) = H(X) + H(Y) - 2 I(X, Y)
with entropy H(X) = - sum p(x) log p(x)
and the KL divergence between the x,y considered independent and
the actual joint distribution I(X, Y) = KL( p(x,y) || p(x)p(y) )
private void checkConsistency()
public void saveTopicCats(double[][] pjoint,
double[] ptopic,
java.lang.String file)
pjoint - ptopic -
private void saveCatTopics(double[][] pjoint,
double[] pcat,
java.lang.String file)
pjoint - jcat - public double[] pCat()
public double[] pTopic()
double pCatForDoc(int cat,
int doc)
cat - doc -
double entropy(double[] p)
p -
private double mutualInfo(double[] pcat,
double[] ptopic)
pcat - categories distribution p(c=r)ptopic - topics distribution p(z=s)
private double mutualInfo(double[] pcat,
double[] ptopic,
double[][] pjoint)
pcat - categories distribution p(c=r)ptopic - topics distribution p(z=s)pjoint - joint distribution p(c=r, z=s)
private double[][] pJoint()
private void loadCategoryDists(java.lang.String file)
file - public double mylog(double arg)
public double sum(double[] v)
public double[][] transpose(double[][] mat)
mat -
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||