Skip to content

Commit

Permalink
Merge pull request #110 from NIHOPA/pattern_to_spacy
Browse files Browse the repository at this point in the history
Pattern to spaCy
  • Loading branch information
thoppe authored Mar 19, 2019
2 parents 720dfc1 + bf9bf5a commit 8654343
Show file tree
Hide file tree
Showing 84 changed files with 82,478 additions and 1,196 deletions.
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
sudo: false
language: python
python:
- "2.7"
- "3.6"

install:
- pip install -r requirements.txt
Expand Down
11 changes: 10 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,15 @@
[![PyPI](https://img.shields.io/pypi/v/nlpre.svg)](https://pypi.python.org/pypi/nlpre)
[![PyVersion](https://img.shields.io/pypi/pyversions/nlpre.svg)](https://img.shields.io/pypi/pyversions/nlpre.svg)

## Major version update! NLPre 2.0.0

+ Backend NLP engine `pattern.en` has been replaced with `spaCy` v 2.1.0. This is a major fix for some of the problems with `pattern.en` including poor lemmatization. (eg. cytokine -> cytocow)
+ Support for python 2 has been dropped
+ Support for custom dictionaries in `replace_from_dictionary`
+ Option for suffix to be used instead of prefix in `replace_from_dictionary`
+ URL replacement can now remove emails
+ `token_replacement` can remove symbols

NLPre is a text (pre)-processing library that helps smooth some of the inconsistencies found in real-world data.
Correcting for issues like random capitalization patterns, strange hyphenations, and abbreviations are essential parts of wrangling textual data but are often left to the user.

Expand Down Expand Up @@ -64,7 +73,7 @@ nlpre.logger.setLevel(logging.INFO)
| [**replace_acronyms**](nlpre/replace_acronyms.py) | Replaces acronyms and abbreviations found in a document with their corresponding phrase. If an acronym is explicitly identified with a phrase in a document, then all instances of that acronym in the document will be replaced with the given phrase. If there is no explicit indication what the phrase is within the document, then the most common phrase associated with the acronym in the given counter is used. <br> `The EPA protects trees` <br> `The Environmental_Protection_Agency protects trees`
| [**identify_parenthetical_phrases**](nlpre/identify_parenthetical_phrases.py) | Identify abbreviations of phrases found in a parenthesis. Returns a counter and can be passed directly into [`replace_acronyms`](nlpre/replace_acronyms). <br> `'Environmental Protection Agency (EPA)` <br> `Counter((('Environmental', 'Protection', 'Agency'), 'EPA'):1)` |
| [**separated_parenthesis**](nlpre/separated_parenthesis.py) | Separates parenthetical content into new sentences. This is useful when creating word embeddings, as associations should only be made within the same sentence. Terminal punctuation of a period is added to parenthetical sentences if necessary. <br> `Hello (it is a beautiful day) world.` <br>`Hello world. it is a beautiful day .` |
| [**pos_tokenizer**](nlpre/pos_tokenizer.py) | Removes all words that are of a designated part-of-speech (POS) from a document. For example, when processing medical text, it is useful to remove all words that are not nouns or adjectives. POS detection is provided by the [`pattern.en.parse`](http://www.clips.ua.ac.be/pages/pattern-en#parser) module. <br> `The boy threw the ball into the yard` <br> `boy ball yard` |
| [**pos_tokenizer**](nlpre/pos_tokenizer.py) | Removes all words that are of a designated part-of-speech (POS) from a document. For example, when processing medical text, it is useful to remove all words that are not nouns or adjectives. POS detection is provided by the [`spaCy`](https://spacy.io/) module. <br> `The boy threw the ball into the yard` <br> `boy ball yard` |
| [**unidecoder**](nlpre/unidecoder.py) | Converts Unicode phrases into ASCII equivalent. <br> `α-Helix β-sheet` <br> `a-Helix b-sheet` |
| [**dedash**](nlpre/dedash.py) | Hyphenations are sometimes erroneously inserted when text is passed through a word-processor. This module attempts to correct the hyphenation pattern by joining words that if they appear in an English word list. <br> `How is the treat- ment going` <br> `How is the treatment going` |
| [**decaps_text**](nlpre/decaps_text.py) | We presume that case is important, but only when it differs from title case. This class normalizes capitalization patterns. <br> `James and Sally had a fMRI` <br> `james and sally had a fMRI` |
Expand Down
19 changes: 18 additions & 1 deletion development/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,21 @@
## Current speed test
## Current speed test (spaCy 2.1.0)

```
time frac
function
unidecoder 0.000002 0.000005
token_replacement 0.000038 0.000121
dedash 0.000354 0.001111
replace_from_dictionary 0.000584 0.001831
identify_parenthetical_phrases 0.005599 0.017542
titlecaps 0.058357 0.182846
pos_tokenizer 0.058928 0.184635
decaps_text 0.059976 0.187919
separated_parenthesis 0.064683 0.202667
replace_acronyms 0.070638 0.221324
```

#### pattern.en speed test

function time frac
unidecoder 0.000008 0.000122
Expand Down
8 changes: 4 additions & 4 deletions development/time_parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
POS_Blacklist = ["connector","cardinal",
"pronoun","adverb",
"symbol","verb",
"punctuation","modal_verb","w_word"]
"punctuation",]

ABR = nlpre.identify_parenthetical_phrases()(doc2)
key0 = (('systemic', 'lupus', 'erythematosus'), 'SLE')
Expand All @@ -39,18 +39,18 @@
parser = getattr(nlpre, key)()

if key=='unidecoder':
func = lambda : [parser(unicode(x)) for x in [doc2]]
func = lambda : [parser(x) for x in [doc2]]
else:
func = lambda : [parser(x) for x in [doc2]]
cost = timeit.timeit(func, number=n) / n
item = {'function':key, "time":cost}
print item
print (item)
data.append(item)
df = pd.DataFrame(data)
df = df.set_index('function').sort_values('time')
df["frac"] = df.time / df.time.sum()

print df
print (df)



27 changes: 0 additions & 27 deletions fabfile.py

This file was deleted.

5 changes: 1 addition & 4 deletions nlpre/Grammars/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,4 @@
from .parenthesis_nester import parenthesis_nester
from .reference_patterns import reference_patterns

__all__ = [
'parenthesis_nester',
'reference_patterns',
]
__all__ = ["parenthesis_nester", "reference_patterns"]
13 changes: 6 additions & 7 deletions nlpre/Grammars/parenthesis_nester.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,26 +10,25 @@ class parenthesis_nester(object):
def __init__(self):
nest = pypar.nestedExpr
g = pypar.Forward()
nestedParens = nest('(', ')')
nestedBrackets = nest('[', ']')
nestedCurlies = nest('{', '}')
nestedParens = nest("(", ")")
nestedBrackets = nest("[", "]")
nestedCurlies = nest("{", "}")
nest_grammar = nestedParens | nestedBrackets | nestedCurlies

parens = "(){}[]"
letters = ''.join([x for x in pypar.printables
if x not in parens])
letters = "".join([x for x in pypar.printables if x not in parens])
word = pypar.Word(letters)

g = pypar.OneOrMore(word | nest_grammar)
self.grammar = g

def __call__(self, line):
'''
"""
Args:
line: a string
Returns:
tokens: a parsed object
'''
"""

try:
tokens = self.grammar.parseString(line)
Expand Down
72 changes: 37 additions & 35 deletions nlpre/Grammars/reference_patterns.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,54 +4,56 @@

class reference_patterns:
def __init__(self):
real_word_dashes = Word(pyparsing.alphas + '-')
punctuation = Word('.!?:,;-')
punctuation_no_dash = Word('.!?:,;')
punctuation_reference_letter = Word('.:,;-')
real_word_dashes = Word(pyparsing.alphas + "-")
punctuation = Word(".!?:,;-")
punctuation_no_dash = Word(".!?:,;")
punctuation_reference_letter = Word(".:,;-")

printable = Word(pyparsing.printables, exact=1)
letter = Word(pyparsing.alphas, exact=1)
letter_reference = punctuation_reference_letter + letter

nums = Word(pyparsing.nums) + Optional(letter) + \
ZeroOrMore(letter_reference)

word_end = pyparsing.ZeroOrMore(Word(')') | Word('}') | Word(']')) + \
WordEnd()
nums = (
Word(pyparsing.nums)
+ Optional(letter)
+ ZeroOrMore(letter_reference)
)

self.single_number = (
WordStart() +
real_word_dashes +
nums +
word_end
word_end = (
pyparsing.ZeroOrMore(Word(")") | Word("}") | Word("]"))
+ Optional(punctuation_no_dash)
+ WordEnd()
)

self.single_number = WordStart() + real_word_dashes + nums + word_end

self.single_number_parens = (
printable +
letter +
Optional(punctuation_no_dash) +
pyparsing.OneOrMore(
Word('([{', exact=1) +
pyparsing.OneOrMore(nums | Word('-')) +
Word(')]}', exact=1)
) +
word_end
printable
+ letter
+ Optional(punctuation_no_dash)
+ pyparsing.OneOrMore(
Word("([{", exact=1)
+ pyparsing.OneOrMore(nums | Word("-"))
+ Word(")]}", exact=1)
)
+ Optional(punctuation_no_dash)
+ word_end
)

self.number_then_punctuation = (
printable +
letter +
nums +
punctuation +
pyparsing.ZeroOrMore(nums | punctuation) +
word_end
printable
+ letter
+ nums
+ punctuation
+ pyparsing.ZeroOrMore(nums | punctuation)
+ word_end
)

self.punctuation_then_number = (
printable +
letter +
punctuation_no_dash +
nums +
pyparsing.ZeroOrMore(punctuation | nums) +
word_end
printable
+ letter
+ punctuation_no_dash
+ nums
+ pyparsing.ZeroOrMore(punctuation | nums)
+ word_end
)
29 changes: 16 additions & 13 deletions nlpre/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
from .spacy_init import nlp
import logging

from ._version import __version__
from .replace_from_dictionary import replace_from_dictionary
from .separated_parenthesis import separated_parenthesis
Expand All @@ -14,19 +16,20 @@
from .url_replacement import url_replacement

__all__ = [
'separated_parenthesis',
'token_replacement',
'decaps_text',
'dedash',
'pos_tokenizer',
'titlecaps',
'replace_from_dictionary',
'identify_parenthetical_phrases',
'unidecoder',
'replace_acronyms',
'separate_reference',
'url_replacement',
'__version__',
"separated_parenthesis",
"token_replacement",
"decaps_text",
"dedash",
"pos_tokenizer",
"titlecaps",
"replace_from_dictionary",
"identify_parenthetical_phrases",
"unidecoder",
"replace_acronyms",
"separate_reference",
"url_replacement",
"nlp",
"__version__",
]

logger = logging.getLogger(__name__)
Expand Down
10 changes: 9 additions & 1 deletion nlpre/_version.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,16 @@
# One canonical source for the version number,
# Versions should comply with PEP440.

__version__ = "1.2.3"
__version__ = "2.0.0"

# 2.0.0 Major update, breaking changes! pattern.en replaced with spaCy.
# + Most spaces before terminal punc removed.
# + Support for python 2 has been dropped
# + Backend NLP engine `pattern.en` has been replaced with `spaCy`
# + Support for custom dictionaries in `replace_from_dictionary`
# + Option for suffix to be used instead of prefix in `replace_from_dictionary`
# + URL replacement can now remove emails
# + `token_replacement` can remove symbols

# 1.2.3 Fixed the version for mysqlclient to help windows installs.
# Version tracking started in 1.2.3 (add to the top)
10 changes: 5 additions & 5 deletions nlpre/decaps_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,21 +18,21 @@ def __init__(self):
self.logger = logging.getLogger(__name__)

def modify_word(self, org):
'''
"""
Changes a word to lower case if it contains exactly one capital letter.
Args:
org: a string
Returns:
lower: the lowercase of org, a string
'''
"""

lower = org.lower()

if self.diffn(org, lower) > 1:
return org
elif org != lower:
self.logger.info('Decapitalizing word %s to %s' % (org, lower))
self.logger.info("Decapitalizing word %s to %s" % (org, lower))
return lower

def __call__(self, text):
Expand All @@ -52,8 +52,8 @@ def __call__(self, text):
for sent in sentences:

sent = [self.modify_word(w) for w in sent]
doc2.append(' '.join(sent))
doc2.append(" ".join(sent))

doc2 = '\n'.join(doc2)
doc2 = "\n".join(doc2)

return doc2
Loading

0 comments on commit 8654343

Please sign in to comment.