|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.knowceans.corpus.NumCorpus
public class NumCorpus
Represents a corpus of documents, using numerical data only.
| Constructor Summary | |
|---|---|
NumCorpus()
|
|
NumCorpus(Document[] docs,
int numTerms,
int numWords)
|
|
NumCorpus(java.lang.String dataFilename)
|
|
NumCorpus(java.lang.String dataFilename,
int readlimit)
init the corpus with a reduced set of documents |
|
| Method Summary | |
|---|---|
Document |
getDoc(int index)
|
int[][] |
getDocParBounds()
get array of paragraph start indices of the documents (term-based) |
Document[] |
getDocs()
|
int[][][] |
getDocTermsFreqs()
get array of document terms and frequencies |
int[][] |
getDocWordParBounds()
get array of paragraph start indices of the documents (word-based) |
int[] |
getDocWords(int m,
java.util.Random rand)
Get the words of document doc as a scrambled varseq. |
int[][] |
getDocWords(java.util.Random rand)
Get the documents as vectors of bag of words, i.e., per document, a scrambled array of term indices is generated. |
int |
getNumDocs()
|
int |
getNumTerms()
|
int |
getNumTerms(int doc)
|
int |
getNumWords()
|
int |
getNumWords(int doc)
|
int[][] |
getOrigDocIds()
get the original ids of documents according to the corpus file read in. |
ICorpus |
getTestCorpus()
return the test corpus split |
ICorpus |
getTrainCorpus()
return the training corpus split |
static void |
main(java.lang.String[] args)
test corpus reading and splitting |
void |
mergeDocPars()
merge document paragraphs into a single document each. |
void |
read(java.lang.String dataFilename)
read a file in "pseudo-SVMlight" format. |
void |
reduce(int ndocs,
java.util.Random rand)
reduce the size of the corpus to ndocs maximum. |
void |
setDoc(int index,
Document doc)
|
void |
setDocs(Document[] documents)
|
void |
split(int order,
int split,
java.util.Random rand)
splits two child corpora of size 1/nsplit off the original corpus, which itself is left unchanged (except storing the splits). |
java.lang.String |
toString()
|
void |
write(java.lang.String pathbase)
write the corpus to to a file. |
| Methods inherited from class java.lang.Object |
|---|
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Constructor Detail |
|---|
public NumCorpus(java.lang.String dataFilename)
public NumCorpus(java.lang.String dataFilename,
int readlimit)
dataFilename - readlimit - public NumCorpus()
public NumCorpus(Document[] docs,
int numTerms,
int numWords)
| Method Detail |
|---|
public void read(java.lang.String dataFilename)
nterms (term:freq){nterms}
for each paragraph in the document. This way, each paragraph
dataFilename - public Document[] getDocs()
public int[][][] getDocTermsFreqs()
getDocTermsFreqs in interface ITermCorpuspublic int[][] getDocParBounds()
public int[][] getDocWordParBounds()
public void mergeDocPars()
public Document getDoc(int index)
index -
public int[][] getDocWords(java.util.Random rand)
getDocWords in interface ICorpusrand - random number generator or null to use standard generator
public int getNumWords()
getNumWords in interface ICorpus
public int[] getDocWords(int m,
java.util.Random rand)
getDocWords in interface ICorpusm - rand - random number generator or null to omit shuffling
public void setDoc(int index,
Document doc)
index - doc - public int getNumDocs()
getNumDocs in interface ICorpuspublic int getNumTerms()
getNumTerms in interface ICorpuspublic int getNumTerms(int doc)
public int getNumWords(int doc)
public void setDocs(Document[] documents)
documents - public java.lang.String toString()
toString in class java.lang.Object
public void reduce(int ndocs,
java.util.Random rand)
ndocs - rand -
public void split(int order,
int split,
java.util.Random rand)
split in interface ISplitCorpusorder - number of partitionssplit - 0-based split of corpus returnedrand - random source (null for reusing existing splits)public ICorpus getTrainCorpus()
getTrainCorpus in interface ISplitCorpuspublic ICorpus getTestCorpus()
getTestCorpus in interface ISplitCorpuspublic int[][] getOrigDocIds()
getOrigDocIds in interface ISplitCorpus
public void write(java.lang.String pathbase)
throws java.io.IOException
pathbase -
java.io.IOExceptionpublic static void main(java.lang.String[] args)
args -
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||