Easy lightweight text classification via Naive Bayes. Give it a small set of example data, and it will classify similar inputs with blazing speed and pretty good accuracy.
Making a basic chatbot? Want to do basic auto-suggestion of misspelled commands? Don't want to bust out TensorFlow or scikit-learn? Classy's got you covered.
Classify some data (probably by hand) into a dict
:
data = {'lights':['Could you turn my lights off?',
'Turn my lights off',
'Are my lights off?',
'All lights off, please',
'Turn some lights on',
'Which bulbs are on?'],
'alarm': ['Set an alarm for tomorrow at 6:00',
'What time is my alarm?',
'When will I wake up tomorrow?',
'What time is wakeup tomorrow?']}
Create the Classifier object:
import classy
c = classy.Classifier(data)
To classify text, simply use .classify
:
c.classify('Which of my lights are off?')
{'lights': 0.9981515711645101, 'alarm': 0.0018484288354898338}
If you wish to only receive a single classified label, rather than a full dict
of
probabilities, set the .threshold
property. The .classify()
method will then return
the label of the matched class, or None
if all probabilities are below the threshold (ie, if
the classifier is uncertain). Behavior for .threshold
values <= 0.5
is undefined.
c = classy.Classifier(data)
c.threshold = 0.9
c.classify('Which of my lights are off?')
c.classify('Some words we've never seen before')
'lights'
None
Classy by default performs minimal preprocessing of incoming text, equivalent to:
def parse(text):
# makes all uppercase characters lowercase
text = text.lower()
# removes all except alphanumerics and spaces
text = re.sub(r'[^a-z0-9 ]',r'',text)
# splits by spaces, and discards all empty strings
return [i for i in text.split(' ') if i != '']
If you wish to supply a custom string parsing function, simply provide it as the f
argument when creating a Classifier
object:
def newParse(t):
return [i for i in t.split(',') if i != '']
c = classy.Classifier(data,f=newParse)
Want a bare-bones spellchecker over a (very) limited set of inputs?
def allSubsets(text):
temp = []
for a in range(len(text)):
for b in range(a,len(text)):
temp.append(text[a:b+1])
return temp
data = {'push':['push'],
'commit':['commit'],
'pull':['pull'],
'diff':['diff']}
c = Classifier(data,f=allSubsets,threshold=0.6)
c.classify('commot')
c.classify('cpmmot')
c.classify('pulll')
c.classify('diffg')
'commit'
'commit'
'pull'
'diff'
The allSubsets
function enumerates all substrings of its input, which does a reasonably decent job
of spellchecking when paired with Naive Bayes. This usage should be considered a cool trick temporary hackjob,
as there are many, many better ways to do this, for example Levenshtein distance.