-
Notifications
You must be signed in to change notification settings - Fork 3
XX[OLD] Normalize Institution Names
When searching the PublicationQuery::TextSearch
data, which could have Affiliation
in the AggretateText
data, is there a controlled vocabulary that can be used to better match institution names?
For example, is "Stanford University" more accurate than just "Stanford"? (For testing, this could be useful, but for the actual application, the logic behind using "Stanford" instead of "Stanford University" is to catch other institutions affiliated with Stanford that are not the university itself, such as the hospital, cancer institute, woods institute, etc.)
The current PublicationQuery
search includes Stanford
in the TextSearch
criterion (using an and
conjunction with the author name and an ExactMatch
search type), see
For any matches, when a PublicationItemId
is resolved to get the PublicationItem
, it’s curious that the PublicationItem
returned has no author affiliation data in any of it's fields (see pp. 34-36 of the SW-API docs).
The SW-API documentation for PublicationQuery::TextSearch
indicates that it searches all AggregateText
fields of a publication (see pp. 25-26). This includes: PMID
, Abstract
, Title
, AuthorList
, KeywordList
, Affiliation*
, PublicationSourceTitle
, PublicationSubjectCategoryList
, AuthorEmailAddresses
, MeSHTerms*
, ChemicalSubstanceNames*
, DOI
. (The *
fields "are not part of the queryable API and therefore not included in the PublicationItems
list.)
There is no specific information in the SW-API docs on how to limit an "institution" match against author affiliations and such data are not returned in a PublicationItem
document. As it is now, the institution name (e.g. Stanford
) could match anything in the AggregateText
, not just author affiliation. There are specific filters for author name parts (first, middle and last name), but not for author affiliation. The institution name, like Stanford
, could match an email address because the AggregateText
includes the AuthorEmailAddresses
field.
See also issues #232, #250, #285 and #288.
For the institution name, we tried a few things and found that a normalized string might work, something like this:
# Normalize the institution by removing some common name elements that do
# nothing to distinguish the institution.
def normalize_institution(institution)
exclude_words = %w(university institute organization corporation and the of)
exclude_regex = exclude_words.join('|')
institution.gsub!(/#{exclude_regex}/i, '')
institution.gsub!(/\s+/, ' ')
institution.strip!
institution.downcase # it's not case sensitive
end
In #250, we concluded that any data with institution="all" should be skipped.
If the institution normalized form works better, is that because it can be matched against an email domain? See #285 and the notes below on using acronyms (which are often used in email/web domains).
There is an interesting note about email data in the SW service: in the smart search notes in the SW-API docs (p. 8), it states that Medline and ScienceWire began to collect email addresses in 1996 and 1997, respectively. This is a clue to the effectiveness of emails in the smart search and possibly a reason why the smart search is not currently used without any seed publications, even if it could be used with only an email address. This is also a clue to how effective email could be for the PublicationQuery search.
With regard to controlled vocabulary, there are helpful comments from Grace (April 4, 2016) noted in consul, i.e.:
- "It is my understanding that ScienceWire is only searching the address field, not the Organization-Enhanced field (which I tend to use when I am troubleshooting because it is supposed to cluster variants together). Please note that institute/dept names are all abbreviated so it is safest to search only Stanford. Looks like to be comprehensive we would need to search Stanford as an address plus Stanford University in the Organizations-Enhanced index to retrieve citations where Stanford wasn't used, only the name of the institute or place instead (e.g. SUMC for Stanford University Medical School)."
- "institute/dept names are all abbreviated"
-
SUMC
for Stanford University Medical School
-
This is consistent with some test queries; we found that UCSB
returned more accurate results than University of California, Santa Barbara
. We also found that ucsb
is equally as effective as UCSB
.
We also found a project doing some university authority work at
The school names feature looks useful, e.g.
Swot::school_name '[email protected]'
# => "University of Strathclyde"
Swot::school_name 'http://www.stanford.edu'
# => "Stanford University”
Hypothetically, if we have an alt-name email address, we could use swot to get the name of the institution (or the CAP UI could do this).