NumCorpus

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.knowceans.corpus
Class NumCorpus

java.lang.Object
  org.knowceans.corpus.NumCorpus

All Implemented Interfaces:: ICorpus, ISplitCorpus, ITermCorpus

Direct Known Subclasses:: LabelNumCorpus

public class NumCorpus
extends java.lang.Object
implements ICorpus, ITermCorpus, ISplitCorpus
extends java.lang.Object
implements ICorpus, ITermCorpus, ISplitCorpus

Represents a corpus of documents, using numerical data only.

Author:: heinrich

Constructor Summary
`NumCorpus()`
`NumCorpus(Document[] docs, int numTerms, int numWords)`
`NumCorpus(java.lang.String dataFilename)`
`NumCorpus(java.lang.String dataFilename, int readlimit)` init the corpus with a reduced set of documents

Method Summary
`Document`	`getDoc(int index)`
`int[][]`	`getDocParBounds()` get array of paragraph start indices of the documents (term-based)
`Document[]`	`getDocs()`
`int[][][]`	`getDocTermsFreqs()` get array of document terms and frequencies
`int[][]`	`getDocWordParBounds()` get array of paragraph start indices of the documents (word-based)
`int[]`	`getDocWords(int m, java.util.Random rand)` Get the words of document doc as a scrambled varseq.
`int[][]`	`getDocWords(java.util.Random rand)` Get the documents as vectors of bag of words, i.e., per document, a scrambled array of term indices is generated.
`int`	`getNumDocs()`
`int`	`getNumTerms()`
`int`	`getNumTerms(int doc)`
`int`	`getNumWords()`
`int`	`getNumWords(int doc)`
`int[][]`	`getOrigDocIds()` get the original ids of documents according to the corpus file read in.
`ICorpus`	`getTestCorpus()` return the test corpus split
`ICorpus`	`getTrainCorpus()` return the training corpus split
`static void`	`main(java.lang.String[] args)` test corpus reading and splitting
`void`	`mergeDocPars()` merge document paragraphs into a single document each.
`void`	`read(java.lang.String dataFilename)` read a file in "pseudo-SVMlight" format.
`void`	`reduce(int ndocs, java.util.Random rand)` reduce the size of the corpus to ndocs maximum.
`void`	`setDoc(int index, Document doc)`
`void`	`setDocs(Document[] documents)`
`void`	`split(int order, int split, java.util.Random rand)` splits two child corpora of size 1/nsplit off the original corpus, which itself is left unchanged (except storing the splits).
`java.lang.String`	`toString()`
`void`	`write(java.lang.String pathbase)` write the corpus to to a file.

Methods inherited from class java.lang.Object
`equals, getClass, hashCode, notify, notifyAll, wait, wait, wait`

Constructor Detail

NumCorpus

public NumCorpus(java.lang.String dataFilename)

NumCorpus

public NumCorpus(java.lang.String dataFilename,
                 int readlimit)

init the corpus with a reduced set of documents

Parameters:: dataFilename -; readlimit -

NumCorpus

public NumCorpus()

NumCorpus

public NumCorpus(Document[] docs,
                 int numTerms,
                 int numWords)

Method Detail

read

public void read(java.lang.String dataFilename)

read a file in "pseudo-SVMlight" format. The format is extended by a paragraph-aware version that repeats the pattern

nterms (term:freq){nterms}

for each paragraph in the document. This way, each paragraph

Parameters:: dataFilename -

getDocs

public Document[] getDocs()

Returns:

getDocTermsFreqs

public int[][][] getDocTermsFreqs()

get array of document terms and frequencies

Specified by:: getDocTermsFreqs in interface ITermCorpus

Returns:: docs[0 = terms, 1 = frequencies][m][t]

getDocParBounds

public int[][] getDocParBounds()

get array of paragraph start indices of the documents (term-based)

Returns:

getDocWordParBounds

public int[][] getDocWordParBounds()

get array of paragraph start indices of the documents (word-based)

Returns:

mergeDocPars

public void mergeDocPars()

merge document paragraphs into a single document each.

getDoc

public Document getDoc(int index)

Parameters:: index -
Returns:

getDocWords

public int[][] getDocWords(java.util.Random rand)

Get the documents as vectors of bag of words, i.e., per document, a scrambled array of term indices is generated.

Specified by:: getDocWords in interface ICorpus

Parameters:: rand - random number generator or null to use standard generator
Returns:

getNumWords

public int getNumWords()

Specified by:: getNumWords in interface ICorpus

getDocWords

public int[] getDocWords(int m,
                         java.util.Random rand)

Get the words of document doc as a scrambled varseq. For paragraph-based documents, scrambles the paragraphs separately, preserving their boundaries.

Specified by:: getDocWords in interface ICorpus

Parameters:: m -; rand - random number generator or null to omit shuffling
Returns:

setDoc

public void setDoc(int index,
                   Document doc)

Parameters:: index -; doc -

getNumDocs

public int getNumDocs()

Specified by:: getNumDocs in interface ICorpus

Returns:

getNumTerms

public int getNumTerms()

Specified by:: getNumTerms in interface ICorpus

Returns:

getNumTerms

public int getNumTerms(int doc)

getNumWords

public int getNumWords(int doc)

setDocs

public void setDocs(Document[] documents)

Parameters:: documents -

toString

public java.lang.String toString()

Overrides:: toString in class java.lang.Object

reduce

public void reduce(int ndocs,
                   java.util.Random rand)

reduce the size of the corpus to ndocs maximum. This should be called directly after loading as it only reduces the documents and count

Parameters:: ndocs -; rand -

split

public void split(int order,
                  int split,
                  java.util.Random rand)

splits two child corpora of size 1/nsplit off the original corpus, which itself is left unchanged (except storing the splits). The corpora can be retrieved using getTrainCorpus and getTestCorpus after using this function.

Specified by:: split in interface ISplitCorpus

Parameters:: order - number of partitions; split - 0-based split of corpus returned; rand - random source (null for reusing existing splits)

getTrainCorpus

public ICorpus getTrainCorpus()

return the training corpus split

Specified by:: getTrainCorpus in interface ISplitCorpus

Returns:: the training corpus according to the last splitting operation

getTestCorpus

public ICorpus getTestCorpus()

return the test corpus split

Specified by:: getTestCorpus in interface ISplitCorpus

Returns:: the test corpus according to the last splitting operation

getOrigDocIds

public int[][] getOrigDocIds()

get the original ids of documents according to the corpus file read in. If never split, null.

Specified by:: getOrigDocIds in interface ISplitCorpus

Returns:: [training documents, test documents]

write

public void write(java.lang.String pathbase)
           throws java.io.IOException

write the corpus to to a file. TODO: write also document titles and labels (in subclass)

Parameters:: pathbase -
Throws:: java.io.IOException

main

public static void main(java.lang.String[] args)

test corpus reading and splitting

Parameters:: args -

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.knowceans.corpus Class NumCorpus

NumCorpus

NumCorpus

NumCorpus

NumCorpus

read

getDocs

getDocTermsFreqs

getDocParBounds

getDocWordParBounds

mergeDocPars

getDoc

getDocWords

getNumWords

getDocWords

setDoc

getNumDocs

getNumTerms

getNumTerms

getNumWords

setDocs

toString

reduce

split

getTrainCorpus

getTestCorpus

getOrigDocIds

write

main

org.knowceans.corpus
Class NumCorpus