Convert all code to Cython #108

lewismc · 2017-06-13T06:08:50Z

Cython is a superset of the Python programming language, designed to give C-like performance with code which is mostly written in Python. It is well known to have significant speed improvements over Python code... something we would greatly benefit from when using XSEDE compute resources.
I will begin implementing the Cython enhancements after I've implemented application logging and timeit. We can then easily undertake comparisons in runtime code execution between Cython and Python versions of the pycoal toolkit.

ghost · 2017-07-06T11:22:49Z

Converting COAL to Cython might reduce the cost of some of the pixel iteration, but I don't expect it would make a significant difference since I believe most time during mineral classification is spent inside of Spectral Python's spectral angle mapper. Be sure to weigh the cost of rewriting and the cost of using nonstandard Python against the possible time benefits.

See the performance issue for a general discussion on improving time efficiency. Another thing to look into is how much we can shrink the data during processing (from 224 bands down to to 128 or 64 or less) and still maintain acceptable accuracy. There is probably significant room for improvement by throwing away extraneous data and by finding inefficiencies in the algorithm. Using the subset/threshold approach saves on a lot of the comparisons, but I found that classifying with the full spectral library was probably more accurate.

Something that probably would benefit from being written in a more efficient language is the spectral angle mapper algorithm itself. That is, it might be worth looking into either contributing improvements to Spectral Python or reimplementing some of the algorithms from scratch. Spectral's implementation is not all that complex and could easily be reimplemented by hand. The linear algebra library methods (dot products, etc.) are probably the biggest time sinks in this code. Interpreted languages like MATLAB are efficient not because they have tight for loops but because they have fast built-in matrix operators.

ghost · 2017-07-08T02:40:26Z

Also just FYI I don't mean to discourage work on this, I just tend to play devil's advocate when collaborating on things.

ghost · 2017-11-08T07:55:57Z

Assuming that mineral classification is the biggest time sink, the callback we are considering implementing would provide one way to address this issue. The callback could point to an efficient compiled function which would hopefully speed things up even if the rest of the toolkit is in plain old Python. The NumPy SAM implementation is probably not half bad however. At this point the performance analysis is largely speculative so it would probably be best to profile things before attempting to optimize.

ghost · 2018-03-02T11:03:59Z

In an effort to keep the issue tracker up to date, I will close this issue and recommend that the action items above be pursued incrementally in future issues: In particular, logging, profiling, and optimizing. However, if there are any arguments in favor of committing to the Cython language then this issue can be reopened and pursued by anyone motivated enough convert all the code, as I am not deeply attached to Python.

The fundamental goal here is to improve efficiency which brings up a lot of subtleties in the current code and architectural issues in future versions of the library that I think have less to do with the language and more to do with the algorithm. The key question is whether the classifier or the code calling the classifier dominates the running time. This is not totally obvious given things like file I/O, but my intuition is that the classification algorithm would dominate especially when more intensive machine learning algorithms are utilized.

What I am most likely to pursue for my own purposes down the road is factoring out the reusable parts of the library from the parts dedicated to coal mining, or otherwise implement them equivalently. A core library built on top of existing high-performance remote sensing and machine learning libraries (Orfeo ToolBox being one very good candidate) and possibly implemented in a language like C++ would expose a general interface that could then be encapsulated in COAL and other projects while maintaining the user-friendly Python API and compatibility with all of the existing deployment processes.

ghost closed this as completed Mar 2, 2018

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert all code to Cython #108

Convert all code to Cython #108

lewismc commented Jun 13, 2017

ghost commented Jul 6, 2017

ghost commented Jul 8, 2017

ghost commented Nov 8, 2017

ghost commented Mar 2, 2018

Convert all code to Cython #108

Convert all code to Cython #108

Comments

lewismc commented Jun 13, 2017

ghost commented Jul 6, 2017

ghost commented Jul 8, 2017

ghost commented Nov 8, 2017

ghost commented Mar 2, 2018