Skip to content

XX[OLD] Normalize Institution Names

Peter Mangiafico edited this page Apr 13, 2018 · 1 revision

WARNING: THIS PAGE HAS NOT BEEN UPDATED FOR THE WOS API BUT MAY STILL CONTAIN USEFUL INFORMATION

Normalize Institution Names

When searching the PublicationQuery::TextSearch data, which could have Affiliation in the AggretateText data, is there a controlled vocabulary that can be used to better match institution names?

For example, is "Stanford University" more accurate than just "Stanford"? (For testing, this could be useful, but for the actual application, the logic behind using "Stanford" instead of "Stanford University" is to catch other institutions affiliated with Stanford that are not the university itself, such as the hospital, cancer institute, woods institute, etc.)

The current PublicationQuery search includes Stanford in the TextSearch criterion (using an and conjunction with the author name and an ExactMatch search type), see

For any matches, when a PublicationItemId is resolved to get the PublicationItem, it’s curious that the PublicationItem returned has no author affiliation data in any of it's fields (see pp. 34-36 of the SW-API docs).

The SW-API documentation for PublicationQuery::TextSearch indicates that it searches all AggregateText fields of a publication (see pp. 25-26). This includes: PMID, Abstract, Title, AuthorList, KeywordList, Affiliation*, PublicationSourceTitle, PublicationSubjectCategoryList, AuthorEmailAddresses, MeSHTerms*, ChemicalSubstanceNames*, DOI. (The * fields "are not part of the queryable API and therefore not included in the PublicationItems list.)

There is no specific information in the SW-API docs on how to limit an "institution" match against author affiliations and such data are not returned in a PublicationItem document. As it is now, the institution name (e.g. Stanford) could match anything in the AggregateText, not just author affiliation. There are specific filters for author name parts (first, middle and last name), but not for author affiliation. The institution name, like Stanford, could match an email address because the AggregateText includes the AuthorEmailAddresses field.

See also issues #232, #250, #285 and #288.

Initial code to normalize institution name

For the institution name, we tried a few things and found that a normalized string might work, something like this:

# Normalize the institution by removing some common name elements that do
# nothing to distinguish the institution.
def normalize_institution(institution)
  exclude_words = %w(university institute organization corporation and the of)
  exclude_regex = exclude_words.join('|')
  institution.gsub!(/#{exclude_regex}/i, '')
  institution.gsub!(/\s+/, ' ')
  institution.strip!
  institution.downcase # it's not case sensitive
end

In #250, we concluded that any data with institution="all" should be skipped.

Does a normalized institution name match an email domain?

If the institution normalized form works better, is that because it can be matched against an email domain? See #285 and the notes below on using acronyms (which are often used in email/web domains).

There is an interesting note about email data in the SW service: in the smart search notes in the SW-API docs (p. 8), it states that Medline and ScienceWire began to collect email addresses in 1996 and 1997, respectively. This is a clue to the effectiveness of emails in the smart search and possibly a reason why the smart search is not currently used without any seed publications, even if it could be used with only an email address. This is also a clue to how effective email could be for the PublicationQuery search.

Should the institution be normalized to an acronym?

With regard to controlled vocabulary, there are helpful comments from Grace (April 4, 2016) noted in consul, i.e.:

  • "It is my understanding that ScienceWire is only searching the address field, not the Organization-Enhanced field (which I tend to use when I am troubleshooting because it is supposed to cluster variants together). Please note that institute/dept names are all abbreviated so it is safest to search only Stanford. Looks like to be comprehensive we would need to search Stanford as an address plus Stanford University in the Organizations-Enhanced index to retrieve citations where Stanford wasn't used, only the name of the institute or place instead (e.g. SUMC for Stanford University Medical School)."
  • "institute/dept names are all abbreviated"
    • SUMC for Stanford University Medical School

This is consistent with some test queries; we found that UCSB returned more accurate results than University of California, Santa Barbara. We also found that ucsb is equally as effective as UCSB.

SWOT university authority project

We also found a project doing some university authority work at

The school names feature looks useful, e.g.

Swot::school_name '[email protected]'
# => "University of Strathclyde"
Swot::school_name 'http://www.stanford.edu'
# => "Stanford University”

Hypothetically, if we have an alt-name email address, we could use swot to get the name of the institution (or the CAP UI could do this).

Clone this wiki locally