org.knowceans.corpus.parsers.reuters
Class ReutersParser

java.lang.Object
  extended by org.xml.sax.helpers.DefaultHandler
      extended by org.knowceans.corpus.parsers.reuters.ReutersParser
All Implemented Interfaces:
org.xml.sax.ContentHandler, org.xml.sax.DTDHandler, org.xml.sax.EntityResolver, org.xml.sax.ErrorHandler

public class ReutersParser
extends org.xml.sax.helpers.DefaultHandler

EpgParser parses the reuters-21578 dataset into a TextCorpus.

TODO: not completed.

Author:
heinrich

Field Summary
private  java.util.Vector<ReutersDocument> allDocs
           
private  ReutersDocument curDoc
           
private  int nr
           
private  java.lang.String prevWord
           
private  Stemmer stem
           
private  StopWordFilter stop
           
 boolean useBigrams
           
 boolean useStemming
           
 boolean useUnigrams
           
 
Constructor Summary
ReutersParser()
           
ReutersParser(java.lang.String stoplist)
           
 
Method Summary
 void configure(boolean useStemming, boolean useUnigrams, boolean useBigrams)
          configure the parser.
static void main(java.lang.String[] argv)
           
 java.util.Vector<ReutersDocument> parse(java.lang.String file)
          opens the file and parses the content
private  java.util.Vector<ReutersDocument> parseDir(java.lang.String sourcefile)
          Parse directory by adding each XML file's content sequentially.
private  java.util.Vector<ReutersDocument> parseString(java.lang.String r)
          parses the string
private  int parseText(java.lang.String s, java.util.Vector<java.lang.String> words)
          Parse the given text and add terms to the model.
private  java.lang.String removePunct(java.lang.String s)
          Remove all punctuation
 
Methods inherited from class org.xml.sax.helpers.DefaultHandler
characters, endDocument, endElement, endPrefixMapping, error, fatalError, ignorableWhitespace, notationDecl, processingInstruction, resolveEntity, setDocumentLocator, skippedEntity, startDocument, startElement, startPrefixMapping, unparsedEntityDecl, warning
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

curDoc

private ReutersDocument curDoc

stop

private StopWordFilter stop

useStemming

public boolean useStemming

useBigrams

public boolean useBigrams

useUnigrams

public boolean useUnigrams

prevWord

private java.lang.String prevWord

nr

private int nr

stem

private Stemmer stem

allDocs

private java.util.Vector<ReutersDocument> allDocs
Constructor Detail

ReutersParser

public ReutersParser()
Parameters:
argv -

ReutersParser

public ReutersParser(java.lang.String stoplist)
Parameters:
argv -
Method Detail

main

public static void main(java.lang.String[] argv)

configure

public void configure(boolean useStemming,
                      boolean useUnigrams,
                      boolean useBigrams)
configure the parser.

Parameters:
useStemming - use stemming
useUnigrams - use unigrams
useBigrams - use bigrams
sentencesAsDocs -
meldungenAsDocs -

parse

public java.util.Vector<ReutersDocument> parse(java.lang.String file)
opens the file and parses the content

Parameters:
file -
Returns:

parseDir

private java.util.Vector<ReutersDocument> parseDir(java.lang.String sourcefile)
Parse directory by adding each XML file's content sequentially. Doc IDs are taken from the docId tag from inside the xml document.

Parameters:
sourcefile -
Returns:

parseString

private java.util.Vector<ReutersDocument> parseString(java.lang.String r)
parses the string

Parameters:
r -
Returns:

parseText

private int parseText(java.lang.String s,
                      java.util.Vector<java.lang.String> words)
Parse the given text and add terms to the model. Here stop-words and stem filtering is located.

Parameters:
s -
Returns:
number of terms added to words.

removePunct

private java.lang.String removePunct(java.lang.String s)
Remove all punctuation

Parameters:
s -
Returns: