Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with some Unicode chars #4

Open
vityok opened this issue Jan 4, 2013 · 9 comments
Open

Problem with some Unicode chars #4

vityok opened this issue Jan 4, 2013 · 9 comments

Comments

@vityok
Copy link
Contributor

vityok commented Jan 4, 2013

It looks like Wilbur has a problem with certain Unicode chars in certain circumstances.

Code to reproduce:

  1. download RDF/XML date from DBPedia:
wget http://dbpedia.org/data/Semantic_Web.rdf
  1. parse with external format explicitly defined:
(defvar stream (open #P"Semantic_Web.rdf"
             :direction :input
             :external-format :utf-8))
(setf wilbur:*db*
      (wilbur:parse-db-from-stream stream "http://dbpedia.org/page/Semantic_Web"))

Produces error both on CCL and SBCL:

> Error: Cannot decode this: (#\U+30BB #\U+30DE #\U+30F3 #\U+30C6 #\U+30A3 #\U+30C3 #\U+30AF #\U+30FB #\U+30A6 #\U+30A7 #\U+30D6)
> While executing: (:INTERNAL WILBUR::COLLAPSE WILBUR:COLLAPSE-WHITESPACE), in process listener(1).
debugger invoked on a SIMPLE-ERROR in thread
#<THREAD "main thread" RUNNING {AB2F861}>:
  Cannot decode this: (#\HANGUL_SYLLABLE_U #\HANGUL_SYLLABLE_KEU
                       #\HANGUL_SYLLABLE_RA #\HANGUL_SYLLABLE_I
                       #\HANGUL_SYLLABLE_NA)
(WILBUR:COLLAPSE-WHITESPACE "우크라이나")

But everything works fine if the external format is not specified:

(defvar stream (open #P"Semantic_Web.rdf"
             :direction :input))
(setf wilbur:*db*
      (wilbur:parse-db-from-stream stream "http://dbpedia.org/page/Semantic_Web"))

Produces:

#<TEMPORARY-PARSER-DB size 157 #x1862A5C6>

That then can be successfully queried.

The problem is even more evident when using flexi-streams.

@lisp
Copy link
Owner

lisp commented Jan 4, 2013

when parsed without explicit encoding, are the string literals in the store then correct, or is the text corrupt?

@vityok
Copy link
Contributor Author

vityok commented Jan 4, 2013

Yep, the interesting thing is that without explicit encoding it works just fine. The problem happens when encoding is specified. Encoding can be avoided when reading from a file, but it must be specified in order to parse Drakma http-request byte array with Flexi-streams.

@vityok
Copy link
Contributor Author

vityok commented Jan 7, 2013

The problematic code is in the xml-util.lisp file and is used only in one place: xml-parser.lisp.

The function performs kind of a string trimming:

(collapse-whitespace "    a b c ")   => "a b c"

(collapse-whitespace "aaaa    a b c ") => "aaaa a b c"

P.S. it looks like that the binary bit-wise operations are meant to detect the kind of a Unicode character/character group. i.e. see here.

I think that this code was written prior to Unicode/UTF support on primary Lisp implementations and therefore Ora had to employ these binary tricks. It can be written much easier now...

@arademaker
Copy link

I had the same experience of @vityok.

@arademaker
Copy link

I suspect that if we consider that the whole Wibur's external interface is weak and based on old limitations of libraries and lisp implementations, better than trying to fix the Wilbur's unicode support, we should try to replace the wilbur's parse with cl-rdfxml parse.

Something like the code below worked for me. I still have to improve it a lot and handle the blank-nodes instances created by cl-rdfxml (http://www.cs.rpi.edu/~tayloj/CL-RDFXML/#blank_nodes).

(defun puri-to-node (s)
  (if (eq (type-of s) 'puri:uri)
      (w:node (puri:render-uri s nil))
      s))


(setf w:*db* (make-instance 'wilbur:db))

(defun parse-rdfxml (path) 
  (cl-rdfxml:parse-document (lambda (a b c) 
                  (w:add-triple (w:triple (puri-to-node a) (puri-to-node b) (puri-to-node c))))
                path)

What you think? Is that a good direction? Of course we will add dependences to Wilbur but that, in my opinion, is good and follow recently suggestion http://fare.livejournal.com/169346.html

@lisp
Copy link
Owner

lisp commented Jan 10, 2013

good evening, alex;

On 2013-01-10, at 21:19 , Alexandre Rademaker wrote:

I suspect that if we consider that the whole Wibur's external
interface is weak and based on old limitations of libraries and
lisp implementations, better than trying to fix the Wilbur's
unicode support, we should try to replace the wilbur's parse with
cl-rdfxml parse.

Something like the code below worked for me. I still have to
improve it a lot and handle the blank-nodes instances created by cl-
rdfxml (http://www.cs.rpi.edu/~tayloj/CL-RDFXML/#blank_nodes).

(defun puri-to-node (s) (if (eq (type-of s) 'puri:uri) (w:node
(puri:render-uri s nil)) s)) (setf w:db (make-instance
'wilbur:db)) (defun parse-rdfxml (path) (cl-rdfxml:parse-document
(lambda (a b c) (w:add-triple (w:triple (puri-to-node a) (puri-to-
node b) (puri-to-node c)))) path)
What you think? Is that a good direction? Of course we will add
dependences to Wilbur but that, in my opinion, is good and follow
recently suggestion http://fare.livejournal.com/169346.html

yes, in general fare is correct. the problem is, it is not always
clear which library is best.
i had tried to convince ora - way back then, that it would have been
better to use a common library, but he was not convinced.
i would suggest a different xml library to you, but it also has
dependencies and if cl-rdfxml actually supports the standard and
yields a coherent object model, then it would certainly be worth a
try. the minimum would be, that it use the current network libraries,
has portable or runtime unicode support, and permits to parse
straight to an rdf model without an intermediate dom.

what else?


Reply to this email directly or view it on GitHub.

@vityok
Copy link
Contributor Author

vityok commented Jan 11, 2013

Currently Wilbur works with in-memory RDF databases, but I've found that there are already efforts to create a persistence layer for Wilbur (see Wiki) and there is de.setf.resource that offers some kind of persistence for RDF classes. I guess that there are other Wilbur or RDF-related persistence and query-processing projects that can be found even on GitHub (and probably there are more in the rest of the WWW).

I guess that it would be very nice to bring some of them together to make a feature-rich RDF storage/processing engine.

P.S. here is for example Twinql, a SPARQL engine built on top of Wilbur. But the project is not actively developed (according to the description) and it is very unfortunate if it will remain so...

@arademaker
Copy link

arademaker commented Nov 25, 2018

Can we have a solution for this issue? Actually, for me it doesn't work with or without the :external-format :utf-8.

@arademaker
Copy link

Sorry @vityok , I just saw your PR #5 for 5 years ago. It looks like this repo is abandoned, I will fork it. But how to make quicklisp updated? I opened an issue at quicklisp/quicklisp-projects#1593

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants