-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WidCode problems after Python 3 migration #83
Labels
Comments
@davisagli You worked on WidCode the last time it was changed. Do you have any idea? My current plan is to iterate over |
My suggestion in the previous comment at least solved the problem. |
Michael Howitz wrote at 2019-10-30 03:12 -0700:
I migrated a ZODB for a customer using `zodbupdate` to Python 3. Now I get the following error when searching for a non-ASCII character in a `ZCTextIndex`.
...
`self._docwords` contains a mixture of `byte` and `str` objects. The `str` ones are the empty ones.
Re-indexing the index did not help to solve the problem.
What is the desired datatype for the docwords? `str` or `bytes`. According to `WidCode.encode()` it seems to be `str`.
"docwords" is used to implement phrase searches, i.e. searches
where a document must contain a given sequence of words.
Conceptially, a "docwords" is a sequence of integers (where each integer
is the index of a word in the lexicon) - representing the sequence
of words in the document.
You can then check whether a document contains a given phrase
by converting the phrase into a sequnece of word integers and
then check whether "docwords" contains this seqeunce.
Initially (when Python did not yet have a unicode datatype), those
integers were interpreted as unicode code points (this gives a
unicode string) and then utf-8 encoded (this gives a byte string);
this is called "wid-encoding".
Nowadays, Python supports unicode directly; we could avoid
the "utf-8" encoding. *BUT* this might increase menory consumption
as the "utf-8" encoding may be more compact then the direct
unicode representation.
In addition: with Python 2, there has been a compilation
option telling Python whether to use a 2 or a 4 byte unicode representation
(I do not know whether this option still exists for Python 3).
2-byte Unicode may not be sufficient to represent the widcodes for
all practical "docwords".
Personally, I would keep the "utf-8" encoding and then a
"docwords" would be a byte string - representing an utf-8
encoded sequence of integers (each integer interpreted as a unicode
code point).
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I migrated a ZODB for a customer using
zodbupdate
to Python 3. Now I get the following error when searching for a non-ASCII character in aZCTextIndex
.self._docwords
contains a mixture ofbyte
andstr
objects. Thestr
ones are the empty ones.Re-indexing the index did not help to solve the problem.
What is the desired datatype for the docwords?
str
orbytes
. According toWidCode.encode()
it seems to bestr
.The text was updated successfully, but these errors were encountered: