org.knowceans.corpus
Class LabelNumCorpus

java.lang.Object
  extended by org.knowceans.corpus.NumCorpus
      extended by org.knowceans.corpus.LabelNumCorpus
All Implemented Interfaces:
ICorpus, ILabelCorpus, ISplitCorpus, ITermCorpus

public class LabelNumCorpus
extends NumCorpus
implements ILabelCorpus

Represents a corpus of documents, using numerical data only.

Author:
heinrich

Field Summary
static java.lang.String[] EXTENSIONS
           
 
Fields inherited from interface org.knowceans.corpus.ILabelCorpus
LAUTHORS, LCATEGORIES, LDOCS, LREFERENCES, LTAGS, LTERMS, LVOLS, LYEARS
 
Constructor Summary
LabelNumCorpus()
           
LabelNumCorpus(NumCorpus corp)
          create label corpus from standard one
LabelNumCorpus(java.lang.String dataFilebase)
           
LabelNumCorpus(java.lang.String dataFilebase, boolean parmode)
           
LabelNumCorpus(java.lang.String dataFilebase, int readlimit, boolean parmode)
           
 
Method Summary
 int[][] getDocLabels(int kind)
          loads and returns the document labels of given kind
 int getLabelsMaxN(int kind)
          return the maximum number of labels in any document
 int getLabelsV(int kind)
          get the number of distinct labels in the label field
 int getLabelsW(int kind)
          get the number of tokens in the label field
static void main(java.lang.String[] args)
          test corpus reading and splitting
 void split(int order, int split, java.util.Random rand)
          splits two child corpora of size 1/nsplit off the original corpus, which itself is left unchanged (except storing the splits).
 void write(java.lang.String pathbase)
          write the corpus to to a file.
 
Methods inherited from class org.knowceans.corpus.NumCorpus
getDoc, getDocParBounds, getDocs, getDocTermsFreqs, getDocWordParBounds, getDocWords, getDocWords, getNumDocs, getNumTerms, getNumTerms, getNumWords, getNumWords, getOrigDocIds, getTestCorpus, getTrainCorpus, mergeDocPars, read, reduce, setDoc, setDocs, toString
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface org.knowceans.corpus.ICorpus
getDocWords, getDocWords, getNumDocs, getNumTerms, getNumWords
 

Field Detail

EXTENSIONS

public static final java.lang.String[] EXTENSIONS
Constructor Detail

LabelNumCorpus

public LabelNumCorpus()

LabelNumCorpus

public LabelNumCorpus(java.lang.String dataFilebase)
Parameters:
dataFilebase - (filename without extension)

LabelNumCorpus

public LabelNumCorpus(java.lang.String dataFilebase,
                      boolean parmode)
Parameters:
dataFilebase - (filename without extension)
parmode - if true read paragraph corpus

LabelNumCorpus

public LabelNumCorpus(java.lang.String dataFilebase,
                      int readlimit,
                      boolean parmode)
Parameters:
dataFilebase - (filename without extension)
readlimit - number of docs to reduce corpus when reading (-1 = unlimited)
parmode - if true read paragraph corpus

LabelNumCorpus

public LabelNumCorpus(NumCorpus corp)
create label corpus from standard one

Parameters:
corp -
Method Detail

getDocLabels

public int[][] getDocLabels(int kind)
loads and returns the document labels of given kind

Specified by:
getDocLabels in interface ILabelCorpus
Parameters:
kind - of labels
Returns:

getLabelsMaxN

public int getLabelsMaxN(int kind)
return the maximum number of labels in any document

Parameters:
kind -
Returns:

getLabelsW

public int getLabelsW(int kind)
Description copied from interface: ILabelCorpus
get the number of tokens in the label field

Specified by:
getLabelsW in interface ILabelCorpus
Returns:

getLabelsV

public int getLabelsV(int kind)
Description copied from interface: ILabelCorpus
get the number of distinct labels in the label field

Specified by:
getLabelsV in interface ILabelCorpus
Returns:

split

public void split(int order,
                  int split,
                  java.util.Random rand)
Description copied from class: NumCorpus
splits two child corpora of size 1/nsplit off the original corpus, which itself is left unchanged (except storing the splits). The corpora can be retrieved using getTrainCorpus and getTestCorpus after using this function.

Specified by:
split in interface ISplitCorpus
Overrides:
split in class NumCorpus
Parameters:
order - number of partitions
split - 0-based split of corpus returned
rand - random source (null for reusing existing splits)

write

public void write(java.lang.String pathbase)
           throws java.io.IOException
Description copied from class: NumCorpus
write the corpus to to a file. TODO: write also document titles and labels (in subclass)

Overrides:
write in class NumCorpus
Throws:
java.io.IOException

main

public static void main(java.lang.String[] args)
test corpus reading and splitting

Parameters:
args -