Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 encoded values lose their encoding on zk.get() #81

Open
jperville opened this issue Jun 4, 2014 · 2 comments
Open

UTF-8 encoded values lose their encoding on zk.get() #81

jperville opened this issue Jun 4, 2014 · 2 comments

Comments

@jperville
Copy link

Introduction

As I understand it, zookeeper stores data as array of bytes internally, which should make it encoding-agnostic. However, when I use a zk client library, I expect the data I store in zookeeper to have consistent encoding all the time.

The problem

UTF-8 encoded strings stored in zookeeper with zk.create turn into ASCII-8BIT encoded strings (displaying the UTF-8 bytes) after being retrieved with zk.get. This is true for characters in the ascii range (where it is not too bad) and also for characters outside the ascii range (where it is more problematic because those strings will not encode back to UTF-8 without raising Encoding::UndefinedConversionError).

Workaround: in application code, force the encoding after retrieving the data (eg. data.force_encoding('UTF-8').

PS: Using ruby 1.9.3 and 2.1 on Ubuntu 14.04 LTS (amd 64).

How to reproduce

Save and run the following script (zk must be installed or part of the current bundle):

#!/usr/bin/env ruby1.9.1
# -*- encoding: utf-8 -*-

require 'zk'

def encoding_bug(zk, val, path='/testme-encoding')
  puts "* we would expect the original value and its copy retrieved from zk to be the same"
  puts "* however the retrieved value lost has its original encoding and must be force-encoded"
  puts "* to be usable with eg. JSON.encode() which cast non-UTF-8 strings to UTF-8."

  puts "original value #{val.inspect}, with encoding: " + val.encoding.inspect
  zk.create(path, val)
  begin
    val2 = zk.get(path).first
    puts "retrieved value #{val2.inspect}, with encoding: " + val2.encoding.inspect
    print "attempting to encode val2 to UTF-8, its real encoding => "
    begin
      val2.encode('UTF-8')
      raise "should be failing with Encoding::UndefinedConversionError!"
    rescue Encoding::UndefinedConversionError => e
      puts "as expected, raises " + e.inspect
    end
    print "attempting to force encoding to 'UTF-8', its original encoding => "
    begin
      val2.force_encoding('UTF-8')
      puts "succeeds, val2 is now " + val2.inspect
    rescue => e
      puts "encountered an unexpected exception: " + e.inspect
    end
  ensure
    zk.delete(path)
  end
end

uri = ARGV.first || 'localhost:2181'
encoding_bug(ZK.new(uri), 'é')
@slyphon
Copy link
Contributor

slyphon commented Jun 7, 2014

"I expect the data I store in zookeeper to have consistent encoding
all the time."

Apparently it's ASCII-8BIT :)

All kidding aside, I never stored anything but coordination data in
zookeeper (and IMHO if you're using zk as a database, you're doing it
wrong, but that's immaterial) so I've never paid close attention to
encoding issues.

zk cannot make assumptions about the type of data people are storing.
One place was serializing structs into binary and storing that. How
would ZK handle that use case if it mandated UTF8 data?

On Jun 4, 2014, at 6:07, Julien Pervillé [email protected] wrote:

I expect the data I store in zookeeper to have consistent encoding all the time.

@jperville
Copy link
Author

Actually, we use zk as an index to store location of mirrors to access content of a resources (that may change all the time). The problem on our side was that we didn't validate our input data enough ended up with some of the URIs containing characters that should have been escaped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants