NipsExtractor

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.knowceans.corpus.parsers.nips
Class NipsExtractor

java.lang.Object
  org.knowceans.corpus.parsers.nips.NipsExtractor

public class NipsExtractor
extends java.lang.Object
extends java.lang.Object

NipsExtractor uses the XML and BIB files downloaded by NipsDownload to extract abstracts and content of the papers and convert them to an SVMlight-like corpus file.

TODO: this probably my worst code for years, under the excuse of "lack of time". Restructure and improve!

Author:: heinrich

Nested Class Summary
`(package private) class`	`NipsExtractor.Entry`

Field Summary
`private boolean`	`abstractsonly`
`private AmqCorpus`	`amq`
`(package private) java.lang.String[]`	`bibkeySourcePatterns`
`(package private) java.lang.String[]`	`dirsAndBib`
`(package private) java.lang.String[]`	`fileDestPatterns`
`(package private) java.lang.String`	`fileroot`
`private boolean`	`includerefs`
`private java.lang.String`	`prevWord`
`private EnStemmer`	`stem`
`private StopWordFilter`	`stop`
`boolean`	`useBigrams`
`boolean`	`useStemming`
`boolean`	`useUnigrams`

Constructor Summary
`NipsExtractor()` standard extractor initialises normalisation filters (stoplist and stemmer)
`NipsExtractor(java.lang.String stoplist, boolean stem2, boolean abstractsonly, boolean includerefs)`

Method Summary
`private java.lang.String`	`clean(java.lang.String s)` cleans the string of the most common LaTeX special European characters.
`private AmqCorpus`	`createCorpus(java.util.Vector<NipsDocument> docs, int mindf, int mintf)` take a parsed NipsDocument and add its entries to the corpus.
`private java.util.Vector<java.lang.String>`	`getAuthors(java.lang.String entry)` get the authors from a bibentry
`private java.lang.String`	`getTitle(java.lang.String entry)`
`static void`	`main(java.lang.String[] args)`
`private void`	`normaliseAuthors(java.util.Vector<java.lang.String> authors)` changes all authors to a canonical name, i.e., given names are changed to uppercase initials.
`private NipsExtractor.Entry`	`parseBibEntry(java.lang.String s, int i)` parse a bibentry
`java.util.Map<java.lang.String,NipsExtractor.Entry>`	`parseBibtex()` parse the bibtex files and fill the reading map
`private java.util.Vector<NipsDocument>`	`parseMap(java.util.Map<java.lang.String,NipsExtractor.Entry> map)`
`private void`	`parseTerms(java.util.Vector<NipsDocument> docs, boolean abstractsOnly)` Convert the NipsDocument, which contains only sections in each of the vector elements into one that has terms in them and a section index.
`int`	`parseText(java.lang.String s, java.util.Vector<java.lang.String> words)` Parse the given text and add terms to the model.
`private java.lang.String`	`removePunct(java.lang.String s)` Remove all punctuation
`private java.lang.String`	`replaceAbbreviations(java.lang.String s)`
`void`	`run(java.lang.String fileroot, boolean abstractsonly, int mindf, int mintf)`
`void`	`save(java.lang.String corpusname)`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

stop

private StopWordFilter stop

stem

private EnStemmer stem

fileroot

java.lang.String fileroot

dirsAndBib

java.lang.String[] dirsAndBib

bibkeySourcePatterns

java.lang.String[] bibkeySourcePatterns

fileDestPatterns

java.lang.String[] fileDestPatterns

useStemming

public boolean useStemming

useBigrams

public boolean useBigrams

useUnigrams

public boolean useUnigrams

prevWord

private java.lang.String prevWord

amq

private AmqCorpus amq

includerefs

private boolean includerefs

abstractsonly

private boolean abstractsonly

Constructor Detail

NipsExtractor

public NipsExtractor()

standard extractor initialises normalisation filters (stoplist and stemmer)

NipsExtractor

public NipsExtractor(java.lang.String stoplist,
                     boolean stem2,
                     boolean abstractsonly,
                     boolean includerefs)

Parameters:: stoplist -; stem2 -; abstractsonly -; includerefs -

Method Detail

main

public static void main(java.lang.String[] args)

run

public void run(java.lang.String fileroot,
                boolean abstractsonly,
                int mindf,
                int mintf)

save

public void save(java.lang.String corpusname)

createCorpus

private AmqCorpus createCorpus(java.util.Vector<NipsDocument> docs,
                               int mindf,
                               int mintf)

take a parsed NipsDocument and add its entries to the corpus.

Parameters:: docs -
Returns:

parseTerms

private void parseTerms(java.util.Vector<NipsDocument> docs,
                        boolean abstractsOnly)

Convert the NipsDocument, which contains only sections in each of the vector elements into one that has terms in them and a section index.

Processing is done in 3 steps:

normalize author names
merge pages and TODO: find sections (###LARGE)
convert to string of terms

Parameters:: docs - list of documents with raw content; abstractsOnly - restricts the corpus generation to abstracts for faster test runs.

parseText

public int parseText(java.lang.String s,
                     java.util.Vector<java.lang.String> words)

Parse the given text and add terms to the model. Here stop-words and stem filtering is located.

Parameters:: s -
Returns:: number of terms added to words.

replaceAbbreviations

private java.lang.String replaceAbbreviations(java.lang.String s)

Parameters:: s -
Returns:

removePunct

private java.lang.String removePunct(java.lang.String s)

Remove all punctuation

Parameters:: s -
Returns:

normaliseAuthors

private void normaliseAuthors(java.util.Vector<java.lang.String> authors)

changes all authors to a canonical name, i.e., given names are changed to uppercase initials.

Parameters:: authors -

parseBibtex

public java.util.Map<java.lang.String,NipsExtractor.Entry> parseBibtex()

parse the bibtex files and fill the reading map

parseBibEntry

private NipsExtractor.Entry parseBibEntry(java.lang.String s,
                                          int i)

parse a bibentry

Parameters:: s - bibtex entry; i - pattern index (different file name convention for each year)
Returns:

getAuthors

private java.util.Vector<java.lang.String> getAuthors(java.lang.String entry)

get the authors from a bibentry

Parameters:: entry -
Returns:

getTitle

private java.lang.String getTitle(java.lang.String entry)

clean

private java.lang.String clean(java.lang.String s)

cleans the string of the most common LaTeX special European characters.

Parameters:: s -
Returns:

parseMap

private java.util.Vector<NipsDocument> parseMap(java.util.Map<java.lang.String,NipsExtractor.Entry> map)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.knowceans.corpus.parsers.nips Class NipsExtractor

stop

stem

fileroot

dirsAndBib

bibkeySourcePatterns

fileDestPatterns

useStemming

useBigrams

useUnigrams

prevWord

amq

includerefs

abstractsonly

NipsExtractor

NipsExtractor

main

run

save

createCorpus

parseTerms

parseText

replaceAbbreviations

removePunct

normaliseAuthors

parseBibtex

parseBibEntry

getAuthors

getTitle

clean

parseMap

org.knowceans.corpus.parsers.nips
Class NipsExtractor