-
Notifications
You must be signed in to change notification settings - Fork 1
RiskLib 0.1 Parsing
Overview:
The OpenQuake engines need to support a variety of input and output formats. At a later stage in the project it will likely make sense to develop a rigorous and documented formal exchange format - at this point, it's most important for us to:
- Not duplicate work
- Get things working end-to-end
- Support as diverse a group of real-world users as possible, as early as possible
With this in mind, I suggest that, rather than undertaking formal development of the data format specification, we simply treat it as an area of common development. However, this means it's truly COMMON - one set of python modules that are collaborated upon. I expect to see a tremendous amount of discussion, either in Skype and IRC, or on a mailing list (if folks would like to take the time to develop well-reasoned rationale for their approach). From a technical standpoint, let's make sure we're using the appropriate underlying python classes for each type of input file:
- If it's a data file (e.g., if we need to support both input and output of this format), use the Python "codecs" module, and implement IncrementalEncoder and IncrementalDecoder.
- If it's a configuration file, make sure you shouldn't be using a --flagfile before using properties/ini config files.
- When you're writing your parsing library, make sure you can round-trip the data (decode a file, and then encode to a file, and end up with equivalent files.)
Note also that the python zlib_codec supports on-the-fly decompression, which is an optimization for large binary datasets (and is almost always faster than the disk IO).
Some research and references:
- http://codersbuffet.blogspot.com/2010/03/json-vs-xml-and-python-parsing.html
- http://metaoptimize.com/blog/2009/03/22/fast-deserialization-in-python/
- http://en.wikipedia.org/wiki/YAML#Comparison_to_other_data_structure_format_languages
- http://www.kuwata-lab.com/kwalify/
- http://stackoverflow.com/questions/1061482/why-isnt-this-a-valid-schema-for-rx
- http://en.wikipedia.org/wiki/.properties
- http://docs.python.org/library/configparser.html
- http://www.4feets.com/2009/08/serializing-data-json-vs-protocol-buffers/
- http://www.25hoursaday.com/weblog/PermaLink.aspx?guid=dada27bf-2af0-400d-94c9-5575546f5664
- http://c2.com/cgi/wiki?XmlSucks
- http://c2.com/cgi/wiki?RelationalAlternativeToXml
- http://www.xml.com/pub/a/2001/05/02/champion.html
REQUIREMENTS:
- Fast (quantify) serialization and deserialization
- Buffered deserialization
- Straightforward ETL / simple translations
- Schema and schema validation (nice-to-have)
- Human-readable (nice-to-have)
OPTIONS:
- XML
- Binary Data, using:
- Marshall / cPickle
- Protobuf
- Thrift (http://en.wikipedia.org/wiki/Thrift_(protocol) http://incubator.apache.org/thrift/)
- Properties File (plist binary format)
- BSON
- Redis aof (append-only file)
- Ascii
- Properties / ini file
- CSV (with built-in python modules)
- YAML
- JSON (YAML-subset)
Back to Blueprints