org.knowceans.corpus
Class TermCorpus

java.lang.Object
  extended by org.knowceans.corpus.TermCorpus
All Implemented Interfaces:
IRandomAccessTermCorpus, IRandomAccessTermCorpusFiltered, ITermCorpus, ITermCorpusFiltered
Direct Known Subclasses:
AmqCorpus

public class TermCorpus
extends java.lang.Object
implements IRandomAccessTermCorpus, IRandomAccessTermCorpusFiltered

TermCorpusCps collects terms from different documents and creates a corpus from them with a one-to-one term <-> id assignment.

This variation of TermCorpus provides tracking of document and term frequencies so the corpus must only be evaluated once and can be read several times with different mindf and mintf values. Further, all lists have been changed to ArrayList, which is unsynchronized and faster than Vector.

TODO: Filtering terms by minimum term frequency and document frequency could be done in a TermFilter class, possibly as a generalisation of a stoplist.

Author:
heinrich

Field Summary
protected  ICategories cats
           
protected  java.util.HashMap<java.lang.Integer,java.lang.Integer> curDoc
          document
protected  java.util.ArrayList<java.util.Vector<java.lang.Integer>> docCategories
          store docCategories
protected  java.util.ArrayList<java.lang.Integer> docFreqs
          each term's document frequency
protected  java.util.ArrayList<java.lang.String> docNames
          store docNames
protected  java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> docTerms
          each document's term frequencies termid -> frequency(doc)
protected  java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> docTermsFiltered
          each document's filtered term frequencies termid -> frequency(doc) (used when splitting a corpus into filtered and unfiltered terms, e.g., via a minimum document frequency).
 boolean ignoreFiltered
           
protected  int maxId
          maximum term id.
protected  int minDf
           
protected  int minDl
           
protected  int minTf
           
protected  int ndocs
          number of documents
protected  int nterms
          number of unfiltered terms
protected  int ntermsTotal
          number of terms, filtered and unfiltered
protected  int nwords
          number of words
private  int nwordsFiltered
          number of words that have been filtered out.
protected  int OFFSET
          index offset for terms and documents (only tested with 0)
protected  boolean progress
          monitor progress
protected  java.util.ArrayList<java.lang.Integer> termFreqs
          term frequencies
protected  org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> termIndex
          term indices term->termid
 
Constructor Summary
TermCorpus()
           
TermCorpus(EpgCategories categories, int mindf, int mintf, int mindl)
           
TermCorpus(ICategories cats)
           
TermCorpus(ICategories cats, int mindf, int mintf)
          DPA corpus initialiser.
TermCorpus(java.lang.String fileroot, boolean readLowFreq, ICategories cats)
          create an actor-media corpus from files, which means for the corpus root name, all files are read: *.vocab, *.docs, *.actors, *.corpus.
 
Method Summary
 void add(java.util.Vector<java.lang.String> terms)
          add one term vector
(package private)  java.lang.String docCategoriesToString(int docIndex)
           
 java.lang.String docToString(int docIndex, boolean showFiltered)
          Print the document content in order of descending term frequency
 boolean finaliseDocument(java.lang.String name, java.util.Vector<java.lang.Integer> categories)
          Finalise the current document with a name (useful to identify documents but uniqueness not required) and its categories (leave null if unused).
 java.util.ArrayList<java.util.Vector<java.lang.Integer>> getDocCategories()
          Get the categories of all documents.
 java.util.ArrayList<java.lang.String> getDocNames()
          Get a list of all document names / ids.
 java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> getDocTerms()
          Get list of document term maps (index->freq)
 java.util.Map<java.lang.Integer,java.lang.Integer> getDocTerms(int doc)
          Get the document terms as a frequency map id->frequency.
 java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> getDocTermsFiltered()
          Get list of document term maps (index->freq).
 java.util.Map<java.lang.Integer,java.lang.Integer> getDocTermsFiltered(int doc)
          Get the document terms as a frequency map id->frequency.
private  java.util.Vector<java.lang.Integer> getDocWords(int doc, java.util.Random rand)
          Get the words of document doc as a scrambled sequence.
 int[][] getDocWords(java.util.Random rand)
          Get the documents as vectors of bag of words, i.e., per document, a scrambled array of term indices is generated.
 int getNdocs()
          Number of documents in corpus
 int getNterms()
          Number of terms in corpus
 int getNtermsFiltered()
          Number of terms in corpus
 int getNwords()
          Get the number of words (term observations) in the corpus.
 int getNwordsFiltered()
          Number of words in corpus that are filtered.
 org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> getTermIndex()
          Get a bijective map term / id
private  void greedySet(java.util.ArrayList<java.lang.Integer> list, int index, int value)
          Set the element of the list at the specified index to value and increases the size of the list if index >= size.
 java.lang.String lookup(int term)
          look up term for id.
 int lookup(java.lang.String term)
          look up id for term
 java.lang.String lookupDoc(int id)
          look up term for id.
 int lookupDoc(java.lang.String name)
          look up id for document.
 int[] parseQuery(java.lang.String query)
          Lookup multiple terms to create a numeric term vector.
 void readCorpus(java.lang.String file, boolean readFiltered)
          Read the corpus in the format number of terms, id:freq for each term
 void readDocList(java.lang.String file)
          Read the vocabulary from a file with format id = termstring (on each line)
private  void readDocTerms(java.lang.String file, java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> data)
          Read the document term maps into the array of maps.
 void readVocabulary(java.lang.String file, boolean readFiltered)
          reads the vocabulary from a file with line format id = termstring = termfreq docfreq
 int reorderCorpus(boolean filterSplit)
          Reorder the terms by document frequency and split the corpus in a regular and a low-frequency part.
 int reorderCorpus0(boolean filterSplit)
          Reorder the vocabulary so indices of regular-freq terms span the interval 1..maxLsa, which can be used to reduce the size of a topic extraction problem.
 void writeCorpus(java.lang.String file, boolean writeFiltered)
          Write the complete corpus to a file in the format number of terms, id:freq for each term
 void writeDocList(java.lang.String file)
          Write the vocabulary in a file with format id = termstring (on each line)
private  void writeDocTerms(java.lang.String file, java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> data, boolean sorted)
           
 void writeVocabulary(java.lang.String file, boolean sort, boolean writeFiltered)
          write the vocabulary in a file with line format id = termstring = termfreq docfreq
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

progress

protected boolean progress
monitor progress


ignoreFiltered

public boolean ignoreFiltered

cats

protected ICategories cats

OFFSET

protected int OFFSET
index offset for terms and documents (only tested with 0)


termIndex

protected org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> termIndex
term indices term->termid


termFreqs

protected java.util.ArrayList<java.lang.Integer> termFreqs
term frequencies


docFreqs

protected java.util.ArrayList<java.lang.Integer> docFreqs
each term's document frequency


docTerms

protected java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> docTerms
each document's term frequencies termid -> frequency(doc)


docTermsFiltered

protected java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> docTermsFiltered
each document's filtered term frequencies termid -> frequency(doc) (used when splitting a corpus into filtered and unfiltered terms, e.g., via a minimum document frequency).


curDoc

protected java.util.HashMap<java.lang.Integer,java.lang.Integer> curDoc
document


docNames

protected java.util.ArrayList<java.lang.String> docNames
store docNames


docCategories

protected java.util.ArrayList<java.util.Vector<java.lang.Integer>> docCategories
store docCategories


maxId

protected int maxId
maximum term id.


ndocs

protected int ndocs
number of documents


ntermsTotal

protected int ntermsTotal
number of terms, filtered and unfiltered


nterms

protected int nterms
number of unfiltered terms


nwords

protected int nwords
number of words


nwordsFiltered

private int nwordsFiltered
number of words that have been filtered out.


minDf

protected int minDf

minTf

protected int minTf

minDl

protected int minDl
Constructor Detail

TermCorpus

public TermCorpus(java.lang.String fileroot,
                  boolean readLowFreq,
                  ICategories cats)
create an actor-media corpus from files, which means for the corpus root name, all files are read: *.vocab, *.docs, *.actors, *.corpus. If the readUnique flag is set, the *.vocab2 and *.corpus2 files are read as well, i.e., unique terms are included.

Parameters:
fileroot - the root name of all files to be read into the corpus
readLowFreq -

TermCorpus

public TermCorpus()

TermCorpus

public TermCorpus(ICategories cats)

TermCorpus

public TermCorpus(ICategories cats,
                  int mindf,
                  int mintf)
DPA corpus initialiser.

Parameters:
cats -
minDf - use minimum document frequency when reordering
mintf - use minimum term frequency when reordering

TermCorpus

public TermCorpus(EpgCategories categories,
                  int mindf,
                  int mintf,
                  int mindl)
Method Detail

add

public void add(java.util.Vector<java.lang.String> terms)
add one term vector

Parameters:
terms -

finaliseDocument

public boolean finaliseDocument(java.lang.String name,
                                java.util.Vector<java.lang.Integer> categories)
Finalise the current document with a name (useful to identify documents but uniqueness not required) and its categories (leave null if unused). If the document is too short, it is ignored.

Parameters:
name -
categories -
Returns:
true if document was valid (by length)

reorderCorpus

public int reorderCorpus(boolean filterSplit)
Reorder the terms by document frequency and split the corpus in a regular and a low-frequency part.

This is the new implementation of the reorderCorpus() routine, which uses a much clearer design with a table list, my approach to an inline database.

Parameters:
filterSplit -
Returns:

reorderCorpus0

public int reorderCorpus0(boolean filterSplit)
Reorder the vocabulary so indices of regular-freq terms span the interval 1..maxLsa, which can be used to reduce the size of a topic extraction problem. TODO: test and debug.

Parameters:
filterSplit - true if the corpus should be splitted into filtered terms and unfiltered terms, where filtering occurs according to a minimum frequency.
Returns:
the number of unfiltered terms.

writeVocabulary

public void writeVocabulary(java.lang.String file,
                            boolean sort,
                            boolean writeFiltered)
                     throws java.io.IOException
write the vocabulary in a file with line format id = termstring = termfreq docfreq

Parameters:
file -
sort - sorts the vocabulary in alphabetical order
writeFiltered - writes a second vocabulary file that has all unique terms with a 2 added to the file name
Throws:
java.io.IOException

readVocabulary

public void readVocabulary(java.lang.String file,
                           boolean readFiltered)
                    throws java.lang.NumberFormatException,
                           java.io.IOException
reads the vocabulary from a file with line format id = termstring = termfreq docfreq

Parameters:
file -
oldformat - read old format id = string, otherwise string = id
readFiltered - read a second file with filtered terms
Throws:
java.io.IOException
java.lang.NumberFormatException

greedySet

private void greedySet(java.util.ArrayList<java.lang.Integer> list,
                       int index,
                       int value)
Set the element of the list at the specified index to value and increases the size of the list if index >= size.

Parameters:
list -
index -
value -

writeDocList

public void writeDocList(java.lang.String file)
                  throws java.io.IOException
Write the vocabulary in a file with format id = termstring (on each line)

Parameters:
file -
Throws:
java.io.IOException

readDocList

public void readDocList(java.lang.String file)
                 throws java.io.IOException
Read the vocabulary from a file with format id = termstring (on each line)

Parameters:
file -
Throws:
java.io.IOException
java.lang.NumberFormatException

writeCorpus

public void writeCorpus(java.lang.String file,
                        boolean writeFiltered)
                 throws java.io.IOException
Write the complete corpus to a file in the format number of terms, id:freq for each term

Parameters:
file - file name of the corpus file
writeFiltered - if set write the filtered words of a document into a separate file with a 2 added to file extension.
Throws:
java.io.IOException

writeDocTerms

private void writeDocTerms(java.lang.String file,
                           java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> data,
                           boolean sorted)
                    throws java.io.IOException
Throws:
java.io.IOException

readCorpus

public void readCorpus(java.lang.String file,
                       boolean readFiltered)
                throws java.lang.NumberFormatException,
                       java.io.IOException
Read the corpus in the format number of terms, id:freq for each term

Parameters:
file -
readFiltered - if set read the unique words of documents (that are exclusive to this document) from a separate file with a 2 added to file.
Throws:
java.lang.NumberFormatException
java.io.IOException

readDocTerms

private void readDocTerms(java.lang.String file,
                          java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> data)
                   throws java.lang.NumberFormatException,
                          java.io.IOException
Read the document term maps into the array of maps.

Parameters:
file -
data - vector of integer->integer maps which is initialised if argument is null and appended if not.
Throws:
java.lang.NumberFormatException
java.io.IOException

lookup

public java.lang.String lookup(int term)
Description copied from interface: ITermCorpus
look up term for id.

Specified by:
lookup in interface ITermCorpus
Returns:
term string or null if unknown.

lookup

public int lookup(java.lang.String term)
Description copied from interface: ITermCorpus
look up id for term

Specified by:
lookup in interface ITermCorpus
Returns:
term id or -1 if unknown.

parseQuery

public int[] parseQuery(java.lang.String query)
Lookup multiple terms to create a numeric term vector. The only preprocessing provided here is automatic reduction to unfiltered terms known in the index.

Parameters:
query -
Returns:

lookupDoc

public int lookupDoc(java.lang.String name)
look up id for document.

Specified by:
lookupDoc in interface ITermCorpus
Parameters:
name -
Returns:

lookupDoc

public java.lang.String lookupDoc(int id)
look up term for id.

Specified by:
lookupDoc in interface ITermCorpus
Parameters:
term -
Returns:
term string or null if unknown.

docToString

public java.lang.String docToString(int docIndex,
                                    boolean showFiltered)
Print the document content in order of descending term frequency

Parameters:
docIndex -
showFiltered - set if unique terms should be shown
Returns:

getDocWords

public int[][] getDocWords(java.util.Random rand)
Get the documents as vectors of bag of words, i.e., per document, a scrambled array of term indices is generated.

Specified by:
getDocWords in interface ITermCorpus
Parameters:
rand - random number generator or null to use standard generator
Returns:

getDocWords

private java.util.Vector<java.lang.Integer> getDocWords(int doc,
                                                        java.util.Random rand)
Get the words of document doc as a scrambled sequence.

Parameters:
doc -
rand - random number generator or null to use standard generator
Returns:

getDocTerms

public java.util.Map<java.lang.Integer,java.lang.Integer> getDocTerms(int doc)
Description copied from interface: ITermCorpus
Get the document terms as a frequency map id->frequency.

Specified by:
getDocTerms in interface ITermCorpus
Returns:

getDocTermsFiltered

public java.util.Map<java.lang.Integer,java.lang.Integer> getDocTermsFiltered(int doc)
Description copied from interface: ITermCorpusFiltered
Get the document terms as a frequency map id->frequency.

Specified by:
getDocTermsFiltered in interface ITermCorpusFiltered
Returns:

docCategoriesToString

java.lang.String docCategoriesToString(int docIndex)
Parameters:
docIndex -
Returns:

getDocNames

public java.util.ArrayList<java.lang.String> getDocNames()
Description copied from interface: IRandomAccessTermCorpus
Get a list of all document names / ids.

Specified by:
getDocNames in interface IRandomAccessTermCorpus
Returns:

getDocCategories

public java.util.ArrayList<java.util.Vector<java.lang.Integer>> getDocCategories()
Get the categories of all documents.

Returns:

getDocTerms

public java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> getDocTerms()
Description copied from interface: IRandomAccessTermCorpus
Get list of document term maps (index->freq)

Specified by:
getDocTerms in interface IRandomAccessTermCorpus
Returns:

getDocTermsFiltered

public java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> getDocTermsFiltered()
Description copied from interface: IRandomAccessTermCorpusFiltered
Get list of document term maps (index->freq).

Specified by:
getDocTermsFiltered in interface IRandomAccessTermCorpusFiltered
Returns:

getTermIndex

public org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> getTermIndex()
Description copied from interface: IRandomAccessTermCorpus
Get a bijective map term / id

Specified by:
getTermIndex in interface IRandomAccessTermCorpus
Returns:

getNdocs

public int getNdocs()
Description copied from interface: ITermCorpus
Number of documents in corpus

Specified by:
getNdocs in interface ITermCorpus
Returns:

getNterms

public int getNterms()
Description copied from interface: ITermCorpus
Number of terms in corpus

Specified by:
getNterms in interface ITermCorpus
Returns:

getNtermsFiltered

public int getNtermsFiltered()
Description copied from interface: ITermCorpusFiltered
Number of terms in corpus

Specified by:
getNtermsFiltered in interface ITermCorpusFiltered
Returns:

getNwords

public int getNwords()
Description copied from interface: IRandomAccessTermCorpus
Get the number of words (term observations) in the corpus.

Specified by:
getNwords in interface IRandomAccessTermCorpus
Returns:

getNwordsFiltered

public int getNwordsFiltered()
Description copied from interface: ITermCorpusFiltered
Number of words in corpus that are filtered.

Specified by:
getNwordsFiltered in interface ITermCorpusFiltered
Returns: