Problem with some Unicode chars #4

vityok · 2013-01-04T11:42:00Z

It looks like Wilbur has a problem with certain Unicode chars in certain circumstances.

Code to reproduce:

download RDF/XML date from DBPedia:

wget http://dbpedia.org/data/Semantic_Web.rdf

parse with external format explicitly defined:

(defvar stream (open #P"Semantic_Web.rdf"
             :direction :input
             :external-format :utf-8))
(setf wilbur:*db*
      (wilbur:parse-db-from-stream stream "http://dbpedia.org/page/Semantic_Web"))

Produces error both on CCL and SBCL:

> Error: Cannot decode this: (#\U+30BB #\U+30DE #\U+30F3 #\U+30C6 #\U+30A3 #\U+30C3 #\U+30AF #\U+30FB #\U+30A6 #\U+30A7 #\U+30D6)
> While executing: (:INTERNAL WILBUR::COLLAPSE WILBUR:COLLAPSE-WHITESPACE), in process listener(1).

debugger invoked on a SIMPLE-ERROR in thread
#<THREAD "main thread" RUNNING {AB2F861}>:
  Cannot decode this: (#\HANGUL_SYLLABLE_U #\HANGUL_SYLLABLE_KEU
                       #\HANGUL_SYLLABLE_RA #\HANGUL_SYLLABLE_I
                       #\HANGUL_SYLLABLE_NA)
(WILBUR:COLLAPSE-WHITESPACE "우크라이나")

But everything works fine if the external format is not specified:

(defvar stream (open #P"Semantic_Web.rdf"
             :direction :input))
(setf wilbur:*db*
      (wilbur:parse-db-from-stream stream "http://dbpedia.org/page/Semantic_Web"))

Produces:

#<TEMPORARY-PARSER-DB size 157 #x1862A5C6>

That then can be successfully queried.

The problem is even more evident when using flexi-streams.

The text was updated successfully, but these errors were encountered:

lisp · 2013-01-04T12:44:04Z

when parsed without explicit encoding, are the string literals in the store then correct, or is the text corrupt?

vityok · 2013-01-04T13:29:14Z

Yep, the interesting thing is that without explicit encoding it works just fine. The problem happens when encoding is specified. Encoding can be avoided when reading from a file, but it must be specified in order to parse Drakma http-request byte array with Flexi-streams.

vityok · 2013-01-07T18:24:56Z

The problematic code is in the xml-util.lisp file and is used only in one place: xml-parser.lisp.

The function performs kind of a string trimming:

(collapse-whitespace "    a b c ")   => "a b c"

(collapse-whitespace "aaaa    a b c ") => "aaaa a b c"

P.S. it looks like that the binary bit-wise operations are meant to detect the kind of a Unicode character/character group. i.e. see here.

I think that this code was written prior to Unicode/UTF support on primary Lisp implementations and therefore Ora had to employ these binary tricks. It can be written much easier now...

arademaker · 2013-01-10T20:08:22Z

I had the same experience of @vityok.

arademaker · 2013-01-10T20:19:17Z

I suspect that if we consider that the whole Wibur's external interface is weak and based on old limitations of libraries and lisp implementations, better than trying to fix the Wilbur's unicode support, we should try to replace the wilbur's parse with cl-rdfxml parse.

Something like the code below worked for me. I still have to improve it a lot and handle the blank-nodes instances created by cl-rdfxml (http://www.cs.rpi.edu/~tayloj/CL-RDFXML/#blank_nodes).

(defun puri-to-node (s)
  (if (eq (type-of s) 'puri:uri)
      (w:node (puri:render-uri s nil))
      s))


(setf w:*db* (make-instance 'wilbur:db))

(defun parse-rdfxml (path) 
  (cl-rdfxml:parse-document (lambda (a b c) 
                  (w:add-triple (w:triple (puri-to-node a) (puri-to-node b) (puri-to-node c))))
                path)

What you think? Is that a good direction? Of course we will add dependences to Wilbur but that, in my opinion, is good and follow recently suggestion http://fare.livejournal.com/169346.html

lisp · 2013-01-10T21:48:51Z

good evening, alex;

On 2013-01-10, at 21:19 , Alexandre Rademaker wrote:

I suspect that if we consider that the whole Wibur's external
interface is weak and based on old limitations of libraries and
lisp implementations, better than trying to fix the Wilbur's
unicode support, we should try to replace the wilbur's parse with
cl-rdfxml parse.

Something like the code below worked for me. I still have to
improve it a lot and handle the blank-nodes instances created by cl-
rdfxml (http://www.cs.rpi.edu/~tayloj/CL-RDFXML/#blank_nodes).

(defun puri-to-node (s) (if (eq (type-of s) 'puri:uri) (w:node
(puri:render-uri s nil)) s)) (setf w:db (make-instance
'wilbur:db)) (defun parse-rdfxml (path) (cl-rdfxml:parse-document
(lambda (a b c) (w:add-triple (w:triple (puri-to-node a) (puri-to-
node b) (puri-to-node c)))) path)
What you think? Is that a good direction? Of course we will add
dependences to Wilbur but that, in my opinion, is good and follow
recently suggestion http://fare.livejournal.com/169346.html

yes, in general fare is correct. the problem is, it is not always
clear which library is best.
i had tried to convince ora - way back then, that it would have been
better to use a common library, but he was not convinced.
i would suggest a different xml library to you, but it also has
dependencies and if cl-rdfxml actually supports the standard and
yields a coherent object model, then it would certainly be worth a
try. the minimum would be, that it use the current network libraries,
has portable or runtime unicode support, and permits to parse
straight to an rdf model without an intermediate dom.

what else?

—
Reply to this email directly or view it on GitHub.

vityok · 2013-01-11T16:42:54Z

Currently Wilbur works with in-memory RDF databases, but I've found that there are already efforts to create a persistence layer for Wilbur (see Wiki) and there is de.setf.resource that offers some kind of persistence for RDF classes. I guess that there are other Wilbur or RDF-related persistence and query-processing projects that can be found even on GitHub (and probably there are more in the rest of the WWW).

I guess that it would be very nice to bring some of them together to make a feature-rich RDF storage/processing engine.

P.S. here is for example Twinql, a SPARQL engine built on top of Wilbur. But the project is not actively developed (according to the description) and it is very unfortunate if it will remain so...

arademaker · 2018-11-25T23:32:26Z

Can we have a solution for this issue? Actually, for me it doesn't work with or without the :external-format :utf-8.

arademaker · 2018-11-25T23:59:31Z

Sorry @vityok , I just saw your PR #5 for 5 years ago. It looks like this repo is abandoned, I will fork it. But how to make quicklisp updated? I opened an issue at quicklisp/quicklisp-projects#1593

vityok mentioned this issue Jan 7, 2013

replace naive unicode support with system-dependent in collapse-whitespa... #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with some Unicode chars #4

Problem with some Unicode chars #4

vityok commented Jan 4, 2013

lisp commented Jan 4, 2013

vityok commented Jan 4, 2013

vityok commented Jan 7, 2013

arademaker commented Jan 10, 2013

arademaker commented Jan 10, 2013

lisp commented Jan 10, 2013

vityok commented Jan 11, 2013

arademaker commented Nov 25, 2018 •

edited

Loading

arademaker commented Nov 25, 2018

Problem with some Unicode chars #4

Problem with some Unicode chars #4

Comments

vityok commented Jan 4, 2013

lisp commented Jan 4, 2013

vityok commented Jan 4, 2013

vityok commented Jan 7, 2013

arademaker commented Jan 10, 2013

arademaker commented Jan 10, 2013

lisp commented Jan 10, 2013

vityok commented Jan 11, 2013

arademaker commented Nov 25, 2018 • edited Loading

arademaker commented Nov 25, 2018

arademaker commented Nov 25, 2018 •

edited

Loading