WidCode problems after Python 3 migration #83

icemac · 2019-10-30T10:11:58Z

I migrated a ZODB for a customer using zodbupdate to Python 3. Now I get the following error when searching for a non-ASCII character in a ZCTextIndex.

...
  File ".../Products.ZCatalog-5.0.1-py3.7.egg/Products/ZCatalog/ZCatalog.py", line 611, in searchResults
    return self._catalog.searchResults(query, **kw)
  File ".../Products.ZCatalog-5.0.1-py3.7.egg/Products/ZCatalog/Catalog.py", line 1091, in searchResults
    return self.search(query, sort_indexes, reverse, sort_limit, _merge)
  File ".../Products.ZCatalog-5.0.1-py3.7.egg/Products/ZCatalog/Catalog.py", line 634, in search
    rs = self._search_index(cr, index_id, query, rs)
  File ".../Products.ZCatalog-5.0.1-py3.7.egg/Products/ZCatalog/Catalog.py", line 564, in _search_index
    index_rs = index.query_index(index_query, rs)
  File ".../Products.ZCatalog-5.0.1-py3.7.egg/Products/ZCTextIndex/ZCTextIndex.py", line 210, in query_index
    results = tree.executeQuery(self.index)
  File ".../Products.ZCatalog-5.0.1-py3.7.egg/Products/ZCTextIndex/ParseTree.py", line 132, in executeQuery
    return index.search_phrase(self.getValue())
  File ".../Products.ZCatalog-5.0.1-py3.7.egg/Products/ZCTextIndex/BaseIndex.py", line 218, in search_phrase
    if docwords.find(code) >= 0:
TypeError: argument should be integer or bytes-like object, not 'str'

query = {'SearchableMetaData': 'Thüringen-Kliniken', ...}
docwords = b'\x92k+$\xfeQO\'\xfeQ`\x06\xfeQP%\xfeR\x05\x0f\xfeQd1\xfeQOL\xfeQ]\x01\xfeQ]\x02\xdfff&\xfeQR\x0b\xfeQYd\xb5\n\x1a"\xfeTq7\xfeQO\'\xfeQ\\g\xfeQo\x7f\xfeQNt\xfeQU\x1d\xda\'VJ\xa9a\x7f%\xfeQPV\xa50VS\xfeQ]Q'
code = 'þQR\x0bþQ`\x06'

self._docwords contains a mixture of byte and str objects. The str ones are the empty ones.
Re-indexing the index did not help to solve the problem.

What is the desired datatype for the docwords? str or bytes. According to WidCode.encode() it seems to be str.

The text was updated successfully, but these errors were encountered:

icemac · 2019-10-30T10:54:55Z

@davisagli You worked on WidCode the last time it was changed. Do you have any idea?

My current plan is to iterate over _docwords.values() and covert each value to str using value.decode('latin1'). Does this seem to reasonable?

icemac · 2019-10-30T15:01:03Z

My suggestion in the previous comment at least solved the problem.

d-maurer · 2019-10-31T06:08:19Z

Michael Howitz wrote at 2019-10-30 03:12 -0700:

I migrated a ZODB for a customer using `zodbupdate` to Python 3. Now I get the following error when searching for a non-ASCII character in a `ZCTextIndex`. ... `self._docwords` contains a mixture of `byte` and `str` objects. The `str` ones are the empty ones. Re-indexing the index did not help to solve the problem. What is the desired datatype for the docwords? `str` or `bytes`. According to `WidCode.encode()` it seems to be `str`.

"docwords" is used to implement phrase searches, i.e. searches where a document must contain a given sequence of words. Conceptially, a "docwords" is a sequence of integers (where each integer is the index of a word in the lexicon) - representing the sequence of words in the document. You can then check whether a document contains a given phrase by converting the phrase into a sequnece of word integers and then check whether "docwords" contains this seqeunce. Initially (when Python did not yet have a unicode datatype), those integers were interpreted as unicode code points (this gives a unicode string) and then utf-8 encoded (this gives a byte string); this is called "wid-encoding". Nowadays, Python supports unicode directly; we could avoid the "utf-8" encoding. *BUT* this might increase menory consumption as the "utf-8" encoding may be more compact then the direct unicode representation. In addition: with Python 2, there has been a compilation option telling Python whether to use a 2 or a 4 byte unicode representation (I do not know whether this option still exists for Python 3). 2-byte Unicode may not be sufficient to represent the widcodes for all practical "docwords". Personally, I would keep the "utf-8" encoding and then a "docwords" would be a byte string - representing an utf-8 encoded sequence of integers (each integer interpreted as a unicode code point).

icemac added the bug label Oct 30, 2019

icemac self-assigned this Oct 30, 2019

icemac mentioned this issue Oct 7, 2020

ZCTextIndex uses binary data for WID #109

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WidCode problems after Python 3 migration #83

WidCode problems after Python 3 migration #83

icemac commented Oct 30, 2019

icemac commented Oct 30, 2019

icemac commented Oct 30, 2019

d-maurer commented Oct 31, 2019 via email

WidCode problems after Python 3 migration #83

WidCode problems after Python 3 migration #83

Comments

icemac commented Oct 30, 2019

icemac commented Oct 30, 2019

icemac commented Oct 30, 2019

d-maurer commented Oct 31, 2019 via email