Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the call to canonical_target_name() in parse_all.py #26

Open
stevenlujpl opened this issue Jun 22, 2021 · 2 comments
Open

Update the call to canonical_target_name() in parse_all.py #26

stevenlujpl opened this issue Jun 22, 2021 · 2 comments
Assignees

Comments

@stevenlujpl
Copy link
Contributor

The canonical_target_name() function in utils.py accepts 4 parameters, but only 1 parameter is passed to the call to canonical_target_name() in parse_all.py.

cont = {
'label': 'contains', # also stored as 'type'
# target_names (list), cont_names (list)
'target_names': [canonical_target_name(ex[0]['word'])],
'cont_names': [canonical_name(ex[1]['word'])],
# target_ids (list), cont_ids (list)
# - p_id prepended in indexer.py
'target_ids': ['%s_%d_%d' % (ex[0]['ner'].lower(),
ex[0]['characterOffsetBegin'],
ex[0]['characterOffsetEnd'])],
'cont_ids': ['%s_%d_%d' % (ex[1]['ner'].lower(),
ex[1]['characterOffsetBegin'],
ex[1]['characterOffsetEnd'])],
# excerpt_t (sentence)
'sentence': ' '.join([t['originalText'] for \
t in ex[2]['tokens']]),
# source: 'corenlp' (later, change to 'jsre')
'source': 'corenlp',
}

def canonical_target_name(name, id, targets, aliases):
"""
Gets canonical target name
:param name - name whose canonical name is to be looked up
:return canonical name
"""
name = name.strip()
# Look up 'name' in the aliases; if found, replace with its antecedent
# Note: this is super permissive. Exact match on id is safe,
# but we're also allowing any exact-text match with any other
# known target name.
all_targets = [t['annotation_id_s'] for t in targets
if t['name'] == name]
name_aliases = [a['arg2_s'] for a in aliases
if ((a['arg1_s'] == id) or
(a['arg1_s'] in all_targets))]
if len(name_aliases) > 0:
# Ideally there is only one; let's use the first one
can_name = [t['name'] for t in targets \
if t['annotation_id_s'] == name_aliases[0]]
print('Mapping <%s> to <%s>' % (name, can_name[0]))
name = can_name[0]
return re.sub(r"[\s_-]+", " ", name).title().replace(' ', '_')

@stevenlujpl stevenlujpl self-assigned this Jun 22, 2021
@stevenlujpl
Copy link
Contributor Author

@wkiri It seems there are two canonical_target_name() functions. One is in the utils.py script of the parser-indexer repo, and the other one is in the name_utils.py of the MTE repo.

The function in the MTE repo is easy to follow, but I am not sure if I fully understand the intention of the function in the parser-indexer repo. I also don't see how to prepare inputs to call the function in the parser-index repo (specifically for the targets and aliases parameters).

I think the function in the parser-indexer repo may be outdated and should be replaced with the one from the MTE repo. Could you please help take a look?

@wkiri
Copy link
Contributor

wkiri commented Aug 18, 2021

@stevenlujpl It looks to me like the additional arguments were added to allow target matching when known aliases were present. It seems that this is only used (and was probably motivated by) brat_ann_indexer.py, which reads .ann files and stores them in Solr. Since we are no longer using Solr, I think this entire file is deprecated - for MTE purposes at least.

This raises a larger question. The same comment about outdated Solr capabilities applies to csvindexer.py, indexer.py, and solr.py, all of which were set up with Solr infrastructure. Now that we've moved to SQLite, perhaps we should transition (copy?) the current *_parser.py files and json2brat.py back into the main MTE repository and remove the dependency on the parser-indexer repository. These files output JSON, without Solr involved. This would also allow the parsing files to access the same name_utils.py that is in the MTE repo without duplication. Please share your thoughts on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants