org.knowceans.corpus
Class AmqCorpus

java.lang.Object
  extended by org.knowceans.corpus.TermCorpus
      extended by org.knowceans.corpus.AmqCorpus
All Implemented Interfaces:
IRandomAccessTermCorpus, IRandomAccessTermCorpusFiltered, ITermCorpus, ITermCorpusFiltered

public class AmqCorpus
extends TermCorpus

ActorMediaCorpus implements an AMQ corpus, i.e., a document corpus with added functionality for authors and queriers. Cf. AuthorTermCorpus

Author:
heinrich

Field Summary
protected  java.util.Vector<java.lang.String> allActors
          all authors
protected  java.util.Vector<java.util.Vector<java.lang.Integer>> mediaActors
          each document's authors
protected  java.util.Vector<java.lang.String> mediaComments
           
private  int[] mediaRelationCounts
          number of instances of each relation type
protected  java.util.Vector<java.lang.Integer> mediaRelations
          assigns a relation to each medium.
private  int nactors
          Number of actors.
static int REL_AUTHOR
          an authorship relation
static int REL_QUERY
          a query relation
static int REL_RECOMMEND
          a recommendation relation
 
Fields inherited from class org.knowceans.corpus.TermCorpus
cats, curDoc, docCategories, docFreqs, docNames, docTerms, docTermsFiltered, ignoreFiltered, maxId, minDf, minDl, minTf, ndocs, nterms, ntermsTotal, nwords, OFFSET, progress, termFreqs, termIndex
 
Constructor Summary
AmqCorpus()
           
AmqCorpus(ICategories cats)
           
AmqCorpus(int mindf, int mintf)
          Initialiser with size filters.
AmqCorpus(java.lang.String fileroot, boolean readUnique)
          create an actor-media corpus from files, which means for the corpus root name, all files are read: *.vocab, *.docs, *.actors, *.corpus.
AmqCorpus(java.lang.String fileroot, boolean readLowFreq, ICategories cats)
          create an actor-media corpus from files, which means for the corpus root name, all files are read: *.vocab, *.docs, *.actors, *.corpus.
 
Method Summary
 void finaliseDocument(java.lang.String key, java.util.Vector<java.lang.Integer> categories, java.util.Vector<java.lang.String> authors, int relation, java.lang.String comment)
          finalises the current document with a name (useful to identify documents), its categories (leave null if unused) and authors (leave null if unused).
 int[] getActorDocs(int author)
          Get the documents related to the actor.
 java.util.Vector<java.lang.String> getActors()
           
 int[] getDocActors(int doc)
          Get the actors for the document
 int getMaxRelationIndex()
          the highest relation index
 int[][] getMediaActors()
           
 java.util.Vector<java.util.Vector<java.lang.Integer>> getMediaActorsVector()
           
 int[] getMediaRelationCounts()
           
 int[] getMediaRelations()
           
 java.util.Vector<java.lang.Integer> getMediaRelationsVector()
           
 int getNactors()
           
 boolean hasRelations()
          Check whether the corpus has actors and relation data.
private  java.util.Vector<java.lang.Integer> identifyActors(java.util.Vector<java.lang.String> actors)
          Check whether actors are already in the list and assign the number, otherwise add to actors list.
 java.lang.String lookupActor(int id)
          look up term for id.
 void readActorList(java.lang.String file)
          read actor information from a file with format name,
 void readDocList(java.lang.String file)
          reads the document list.
(package private)  void readQueryList(java.lang.String file)
          read query information from a file with format name : \n query1 \n query2 etc., i.e., an actor followed by a list of query lines (which must not contain the " :" string).
 void readRelations(java.lang.String file)
          As an alternative to reading relations from the documents list, this version works with a separate file with line format: rel_id : (actor_id)+, where the line number - 1 is the 0-based document index.
 void setActorList(java.util.Vector<java.lang.String> actors)
           
 void writeActorList(java.lang.String file)
          write the author list in a file with format id = lastname firstinitials : id (on each line) in alphabetical order.
 void writeDocList(java.lang.String file)
          write the author list in a file with format id = firstname(s) ; lastname ; group (on each line).
 
Methods inherited from class org.knowceans.corpus.TermCorpus
add, docCategoriesToString, docToString, finaliseDocument, getDocCategories, getDocNames, getDocTerms, getDocTerms, getDocTermsFiltered, getDocTermsFiltered, getDocWords, getNdocs, getNterms, getNtermsFiltered, getNwords, getNwordsFiltered, getTermIndex, lookup, lookup, lookupDoc, lookupDoc, parseQuery, readCorpus, readVocabulary, reorderCorpus, reorderCorpus0, writeCorpus, writeVocabulary
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

REL_AUTHOR

public static final int REL_AUTHOR
an authorship relation

See Also:
Constant Field Values

REL_QUERY

public static final int REL_QUERY
a query relation

See Also:
Constant Field Values

REL_RECOMMEND

public static final int REL_RECOMMEND
a recommendation relation

See Also:
Constant Field Values

mediaActors

protected java.util.Vector<java.util.Vector<java.lang.Integer>> mediaActors
each document's authors


mediaRelations

protected java.util.Vector<java.lang.Integer> mediaRelations
assigns a relation to each medium. TODO: media with several relations (to several actors, consider the case of recommendation).


allActors

protected java.util.Vector<java.lang.String> allActors
all authors


mediaComments

protected java.util.Vector<java.lang.String> mediaComments

mediaRelationCounts

private int[] mediaRelationCounts
number of instances of each relation type


nactors

private int nactors
Number of actors.

Constructor Detail

AmqCorpus

public AmqCorpus()

AmqCorpus

public AmqCorpus(ICategories cats)
Parameters:
cats -

AmqCorpus

public AmqCorpus(java.lang.String fileroot,
                 boolean readLowFreq,
                 ICategories cats)
create an actor-media corpus from files, which means for the corpus root name, all files are read: *.vocab, *.docs, *.actors, *.corpus. If the readUnique flag is set, the *.vocab2 and *.corpus2 files are read as well, i.e., unique terms are included.

Parameters:
fileroot - the root name of all files to be read into the corpus
readLowFreq -

AmqCorpus

public AmqCorpus(int mindf,
                 int mintf)
Initialiser with size filters.

Parameters:
minDf - use minimum document frequency when reordering
mintf - use minimum term frequency when reordering

AmqCorpus

public AmqCorpus(java.lang.String fileroot,
                 boolean readUnique)
create an actor-media corpus from files, which means for the corpus root name, all files are read: *.vocab, *.docs, *.actors, *.corpus. If the readUnique flag is set, the *.vocab2 and *.corpus2 files are read as well, i.e., unique terms are included.

Parameters:
fileroot - the root name of all files to be read into the corpus
readUnique -
Method Detail

finaliseDocument

public void finaliseDocument(java.lang.String key,
                             java.util.Vector<java.lang.Integer> categories,
                             java.util.Vector<java.lang.String> authors,
                             int relation,
                             java.lang.String comment)
finalises the current document with a name (useful to identify documents), its categories (leave null if unused) and authors (leave null if unused).


identifyActors

private java.util.Vector<java.lang.Integer> identifyActors(java.util.Vector<java.lang.String> actors)
Check whether actors are already in the list and assign the number, otherwise add to actors list.

Parameters:
actors -
Returns:

setActorList

public void setActorList(java.util.Vector<java.lang.String> actors)

writeActorList

public void writeActorList(java.lang.String file)
                    throws java.io.IOException
write the author list in a file with format id = lastname firstinitials : id (on each line) in alphabetical order.

Parameters:
file -
Throws:
java.io.IOException

readActorList

public void readActorList(java.lang.String file)
                   throws java.io.IOException
read actor information from a file with format name,

Parameters:
file -
Throws:
java.io.IOException
java.lang.NumberFormatException

readRelations

public void readRelations(java.lang.String file)
As an alternative to reading relations from the documents list, this version works with a separate file with line format: rel_id : (actor_id)+, where the line number - 1 is the 0-based document index.

This deletes all existing actor and relation information

Parameters:
file -

readQueryList

void readQueryList(java.lang.String file)
             throws java.io.IOException
read query information from a file with format name : \n query1 \n query2 etc., i.e., an actor followed by a list of query lines (which must not contain the " :" string).

readQueryList will ignore unknown terms and actors. Therefore this method MUST be called as the last loader method (i.e., after vocabulary, actors and corpus have been loaded). Queries further are restricted to non-unique terms (can be changed in the code).

Parameters:
file -
Throws:
java.io.IOException
java.lang.NumberFormatException

writeDocList

public void writeDocList(java.lang.String file)
                  throws java.io.IOException
write the author list in a file with format id = firstname(s) ; lastname ; group (on each line).

Overrides:
writeDocList in class TermCorpus
Parameters:
file -
Throws:
java.io.IOException

readDocList

public void readDocList(java.lang.String file)
                 throws java.io.IOException
reads the document list. Format for author -- topic corpus is: docname : categories : authors : relation # comment

Overrides:
readDocList in class TermCorpus
Parameters:
file -
Throws:
java.io.IOException
java.lang.NumberFormatException

lookupActor

public java.lang.String lookupActor(int id)
look up term for id.

Parameters:
term -
Returns:
term string or null if unknown.

getNactors

public int getNactors()
Returns:

getMediaRelationCounts

public int[] getMediaRelationCounts()
Returns:

getMediaActorsVector

public java.util.Vector<java.util.Vector<java.lang.Integer>> getMediaActorsVector()
Returns:

getMediaActors

public int[][] getMediaActors()
Returns:

getActors

public java.util.Vector<java.lang.String> getActors()
Returns:

getMediaRelationsVector

public java.util.Vector<java.lang.Integer> getMediaRelationsVector()
Returns:

getMediaRelations

public int[] getMediaRelations()
Returns:

getMaxRelationIndex

public int getMaxRelationIndex()
the highest relation index

Returns:

hasRelations

public boolean hasRelations()
Check whether the corpus has actors and relation data. This can be used to check if the .docs file had sufficient data and load relations using

Returns:

getActorDocs

public int[] getActorDocs(int author)
Get the documents related to the actor.

Parameters:
author -
Returns:
int[document] // TODO: int[document][relation]

getDocActors

public int[] getDocActors(int doc)
Get the actors for the document

Parameters:
doc -
Returns: