Merge pull request #110 from NIHOPA/pattern_to_spacy

Pattern to spaCy
NIHOPA · Mar 19, 2019 · 8654343 · 8654343
2 parents 720dfc1 + bf9bf5a
commit 8654343
Show file tree

Hide file tree

Showing 84 changed files with 82,478 additions and 1,196 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -1,7 +1,7 @@
 sudo: false
 language: python
 python:
-  - "2.7"
+  - "3.6"
 
 install:
   - pip install -r requirements.txt

diff --git a/README.md b/README.md
@@ -5,6 +5,15 @@
 [![PyPI](https://img.shields.io/pypi/v/nlpre.svg)](https://pypi.python.org/pypi/nlpre)
 [![PyVersion](https://img.shields.io/pypi/pyversions/nlpre.svg)](https://img.shields.io/pypi/pyversions/nlpre.svg)
 
+## Major version update! NLPre 2.0.0
+
++ Backend NLP engine `pattern.en` has been replaced with `spaCy` v 2.1.0. This is a major fix for some of the problems with `pattern.en` including poor lemmatization. (eg. cytokine -> cytocow)
++ Support for python 2 has been dropped
++ Support for custom dictionaries in `replace_from_dictionary`
++ Option for suffix to be used instead of prefix in `replace_from_dictionary`
++ URL replacement can now remove emails
++ `token_replacement` can remove symbols
+
 NLPre is a text (pre)-processing library that helps smooth some of the inconsistencies found in real-world data.
 Correcting for issues like random capitalization patterns, strange hyphenations, and abbreviations are essential parts of wrangling textual data but are often left to the user.
 
@@ -64,7 +73,7 @@ nlpre.logger.setLevel(logging.INFO)
 | [**replace_acronyms**](nlpre/replace_acronyms.py) | Replaces acronyms and abbreviations found in a document with their corresponding phrase. If an acronym is explicitly identified with a phrase in a document, then  all instances of that acronym in the document will be replaced with the given phrase. If there is no explicit indication what the phrase is within the document, then the most common phrase associated with the acronym in the given counter is used. <br> `The EPA protects trees` <br> `The Environmental_Protection_Agency protects trees`
 | [**identify_parenthetical_phrases**](nlpre/identify_parenthetical_phrases.py) | Identify abbreviations of phrases found in a parenthesis. Returns a counter and can be passed directly into [`replace_acronyms`](nlpre/replace_acronyms). <br> `'Environmental Protection Agency (EPA)` <br> `Counter((('Environmental', 'Protection', 'Agency'), 'EPA'):1)` |
 | [**separated_parenthesis**](nlpre/separated_parenthesis.py) | Separates parenthetical content into new sentences. This is useful when creating word embeddings, as associations should only be made within the same sentence. Terminal punctuation of a period is added to parenthetical sentences if necessary. <br> `Hello (it is a beautiful day) world.` <br>`Hello world. it is a beautiful day .` |
-| [**pos_tokenizer**](nlpre/pos_tokenizer.py) | Removes all words that are of a designated part-of-speech (POS) from a document. For example, when processing medical text, it is useful to remove all words that are not nouns or adjectives. POS detection is provided by the [`pattern.en.parse`](http://www.clips.ua.ac.be/pages/pattern-en#parser) module. <br> `The boy threw the ball into the yard` <br> `boy ball yard` |
+| [**pos_tokenizer**](nlpre/pos_tokenizer.py) | Removes all words that are of a designated part-of-speech (POS) from a document. For example, when processing medical text, it is useful to remove all words that are not nouns or adjectives. POS detection is provided by the [`spaCy`](https://spacy.io/) module. <br> `The boy threw the ball into the yard` <br> `boy ball yard` |
 | [**unidecoder**](nlpre/unidecoder.py) | Converts Unicode phrases into ASCII equivalent. <br> `α-Helix β-sheet` <br> `a-Helix b-sheet` |
 | [**dedash**](nlpre/dedash.py) | Hyphenations are sometimes erroneously inserted when text is passed through a word-processor. This module attempts to correct the hyphenation pattern by joining words that if they appear in an English word list. <br> `How is the treat- ment going` <br> `How is the treatment going` |
 | [**decaps_text**](nlpre/decaps_text.py) | We presume that case is important, but only when it differs from title case. This class normalizes capitalization patterns. <br> `James and Sally had a fMRI` <br> `james and sally had a fMRI` |

diff --git a/development/README.md b/development/README.md
@@ -1,4 +1,21 @@
-## Current speed test
+## Current speed test (spaCy 2.1.0)
+
+```
+                                    time      frac
+function                                          
+unidecoder                      0.000002  0.000005
+token_replacement               0.000038  0.000121
+dedash                          0.000354  0.001111
+replace_from_dictionary         0.000584  0.001831
+identify_parenthetical_phrases  0.005599  0.017542
+titlecaps                       0.058357  0.182846
+pos_tokenizer                   0.058928  0.184635
+decaps_text                     0.059976  0.187919
+separated_parenthesis           0.064683  0.202667
+replace_acronyms                0.070638  0.221324
+```
+
+#### pattern.en speed test
 
     function                        time      frac
     unidecoder                      0.000008  0.000122

diff --git a/development/time_parsers.py b/development/time_parsers.py
@@ -21,7 +21,7 @@
 POS_Blacklist = ["connector","cardinal",
                  "pronoun","adverb",
                  "symbol","verb",
-                 "punctuation","modal_verb","w_word"]
+                 "punctuation",]
 
 ABR = nlpre.identify_parenthetical_phrases()(doc2)
 key0 = (('systemic', 'lupus', 'erythematosus'), 'SLE')
@@ -39,18 +39,18 @@
         parser = getattr(nlpre, key)()
 
     if key=='unidecoder':
-        func = lambda : [parser(unicode(x)) for x in [doc2]]
+        func = lambda : [parser(x) for x in [doc2]]
     else:
         func = lambda : [parser(x) for x in [doc2]]
     cost = timeit.timeit(func, number=n) / n
     item = {'function':key, "time":cost}
-    print item
+    print (item)
     data.append(item)
 df = pd.DataFrame(data)
 df = df.set_index('function').sort_values('time')
 df["frac"] = df.time / df.time.sum()
 
-print df
+print (df)
 
 
 
diff --git a/fabfile.py b/fabfile.py
diff --git a/nlpre/Grammars/__init__.py b/nlpre/Grammars/__init__.py
@@ -1,7 +1,4 @@
 from .parenthesis_nester import parenthesis_nester
 from .reference_patterns import reference_patterns
 
-__all__ = [
-    'parenthesis_nester',
-    'reference_patterns',
-]
+__all__ = ["parenthesis_nester", "reference_patterns"]
diff --git a/nlpre/Grammars/parenthesis_nester.py b/nlpre/Grammars/parenthesis_nester.py
@@ -10,26 +10,25 @@ class parenthesis_nester(object):
     def __init__(self):
         nest = pypar.nestedExpr
         g = pypar.Forward()
-        nestedParens = nest('(', ')')
-        nestedBrackets = nest('[', ']')
-        nestedCurlies = nest('{', '}')
+        nestedParens = nest("(", ")")
+        nestedBrackets = nest("[", "]")
+        nestedCurlies = nest("{", "}")
         nest_grammar = nestedParens | nestedBrackets | nestedCurlies
 
         parens = "(){}[]"
-        letters = ''.join([x for x in pypar.printables
-                           if x not in parens])
+        letters = "".join([x for x in pypar.printables if x not in parens])
         word = pypar.Word(letters)
 
         g = pypar.OneOrMore(word | nest_grammar)
         self.grammar = g
 
     def __call__(self, line):
-        '''
+        """
         Args:
             line: a string
         Returns:
              tokens: a parsed object
-        '''
+        """
 
         try:
             tokens = self.grammar.parseString(line)

diff --git a/nlpre/Grammars/reference_patterns.py b/nlpre/Grammars/reference_patterns.py
@@ -4,54 +4,56 @@
 
 class reference_patterns:
     def __init__(self):
-        real_word_dashes = Word(pyparsing.alphas + '-')
-        punctuation = Word('.!?:,;-')
-        punctuation_no_dash = Word('.!?:,;')
-        punctuation_reference_letter = Word('.:,;-')
+        real_word_dashes = Word(pyparsing.alphas + "-")
+        punctuation = Word(".!?:,;-")
+        punctuation_no_dash = Word(".!?:,;")
+        punctuation_reference_letter = Word(".:,;-")
 
         printable = Word(pyparsing.printables, exact=1)
         letter = Word(pyparsing.alphas, exact=1)
         letter_reference = punctuation_reference_letter + letter
 
-        nums = Word(pyparsing.nums) + Optional(letter) + \
-            ZeroOrMore(letter_reference)
-
-        word_end = pyparsing.ZeroOrMore(Word(')') | Word('}') | Word(']')) + \
-            WordEnd()
+        nums = (
+            Word(pyparsing.nums)
+            + Optional(letter)
+            + ZeroOrMore(letter_reference)
+        )
 
-        self.single_number = (
-            WordStart() +
-            real_word_dashes +
-            nums +
-            word_end
+        word_end = (
+            pyparsing.ZeroOrMore(Word(")") | Word("}") | Word("]"))
+            + Optional(punctuation_no_dash)
+            + WordEnd()
         )
 
+        self.single_number = WordStart() + real_word_dashes + nums + word_end
+
         self.single_number_parens = (
-            printable +
-            letter +
-            Optional(punctuation_no_dash) +
-            pyparsing.OneOrMore(
-                Word('([{', exact=1) +
-                pyparsing.OneOrMore(nums | Word('-')) +
-                Word(')]}', exact=1)
-            ) +
-            word_end
+            printable
+            + letter
+            + Optional(punctuation_no_dash)
+            + pyparsing.OneOrMore(
+                Word("([{", exact=1)
+                + pyparsing.OneOrMore(nums | Word("-"))
+                + Word(")]}", exact=1)
+            )
+            + Optional(punctuation_no_dash)
+            + word_end
         )
 
         self.number_then_punctuation = (
-            printable +
-            letter +
-            nums +
-            punctuation +
-            pyparsing.ZeroOrMore(nums | punctuation) +
-            word_end
+            printable
+            + letter
+            + nums
+            + punctuation
+            + pyparsing.ZeroOrMore(nums | punctuation)
+            + word_end
         )
 
         self.punctuation_then_number = (
-            printable +
-            letter +
-            punctuation_no_dash +
-            nums +
-            pyparsing.ZeroOrMore(punctuation | nums) +
-            word_end
+            printable
+            + letter
+            + punctuation_no_dash
+            + nums
+            + pyparsing.ZeroOrMore(punctuation | nums)
+            + word_end
         )
diff --git a/nlpre/__init__.py b/nlpre/__init__.py
@@ -1,4 +1,6 @@
+from .spacy_init import nlp
 import logging
+
 from ._version import __version__
 from .replace_from_dictionary import replace_from_dictionary
 from .separated_parenthesis import separated_parenthesis
@@ -14,19 +16,20 @@
 from .url_replacement import url_replacement
 
 __all__ = [
-    'separated_parenthesis',
-    'token_replacement',
-    'decaps_text',
-    'dedash',
-    'pos_tokenizer',
-    'titlecaps',
-    'replace_from_dictionary',
-    'identify_parenthetical_phrases',
-    'unidecoder',
-    'replace_acronyms',
-    'separate_reference',
-    'url_replacement',
-    '__version__',
+    "separated_parenthesis",
+    "token_replacement",
+    "decaps_text",
+    "dedash",
+    "pos_tokenizer",
+    "titlecaps",
+    "replace_from_dictionary",
+    "identify_parenthetical_phrases",
+    "unidecoder",
+    "replace_acronyms",
+    "separate_reference",
+    "url_replacement",
+    "nlp",
+    "__version__",
 ]
 
 logger = logging.getLogger(__name__)

diff --git a/nlpre/_version.py b/nlpre/_version.py
@@ -1,8 +1,16 @@
 # One canonical source for the version number,
 # Versions should comply with PEP440.
 
-__version__ = "1.2.3"
+__version__ = "2.0.0"
 
+# 2.0.0 Major update, breaking changes! pattern.en replaced with spaCy.
+# + Most spaces before terminal punc removed.
+# + Support for python 2 has been dropped
+# + Backend NLP engine `pattern.en` has been replaced with `spaCy`
+# + Support for custom dictionaries in `replace_from_dictionary`
+# + Option for suffix to be used instead of prefix in `replace_from_dictionary`
+# + URL replacement can now remove emails
+# + `token_replacement` can remove symbols
 
 # 1.2.3 Fixed the version for mysqlclient to help windows installs.
 # Version tracking started in 1.2.3 (add to the top)
diff --git a/nlpre/decaps_text.py b/nlpre/decaps_text.py
@@ -18,21 +18,21 @@ def __init__(self):
         self.logger = logging.getLogger(__name__)
 
     def modify_word(self, org):
-        '''
+        """
         Changes a word to lower case if it contains exactly one capital letter.
 
         Args:
             org: a string
         Returns:
             lower: the lowercase of org, a string
-        '''
+        """
 
         lower = org.lower()
 
         if self.diffn(org, lower) > 1:
             return org
         elif org != lower:
-            self.logger.info('Decapitalizing word %s to %s' % (org, lower))
+            self.logger.info("Decapitalizing word %s to %s" % (org, lower))
         return lower
 
     def __call__(self, text):
@@ -52,8 +52,8 @@ def __call__(self, text):
         for sent in sentences:
 
             sent = [self.modify_word(w) for w in sent]
-            doc2.append(' '.join(sent))
+            doc2.append(" ".join(sent))
 
-        doc2 = '\n'.join(doc2)
+        doc2 = "\n".join(doc2)
 
         return doc2