Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert all code to Cython #108

Closed
lewismc opened this issue Jun 13, 2017 · 4 comments
Closed

Convert all code to Cython #108

lewismc opened this issue Jun 13, 2017 · 4 comments

Comments

@lewismc
Copy link
Member

lewismc commented Jun 13, 2017

Cython is a superset of the Python programming language, designed to give C-like performance with code which is mostly written in Python. It is well known to have significant speed improvements over Python code... something we would greatly benefit from when using XSEDE compute resources.
I will begin implementing the Cython enhancements after I've implemented application logging and timeit. We can then easily undertake comparisons in runtime code execution between Cython and Python versions of the pycoal toolkit.

@ghost
Copy link

ghost commented Jul 6, 2017

Converting COAL to Cython might reduce the cost of some of the pixel iteration, but I don't expect it would make a significant difference since I believe most time during mineral classification is spent inside of Spectral Python's spectral angle mapper. Be sure to weigh the cost of rewriting and the cost of using nonstandard Python against the possible time benefits.

See the performance issue for a general discussion on improving time efficiency. Another thing to look into is how much we can shrink the data during processing (from 224 bands down to to 128 or 64 or less) and still maintain acceptable accuracy. There is probably significant room for improvement by throwing away extraneous data and by finding inefficiencies in the algorithm. Using the subset/threshold approach saves on a lot of the comparisons, but I found that classifying with the full spectral library was probably more accurate.

Something that probably would benefit from being written in a more efficient language is the spectral angle mapper algorithm itself. That is, it might be worth looking into either contributing improvements to Spectral Python or reimplementing some of the algorithms from scratch. Spectral's implementation is not all that complex and could easily be reimplemented by hand. The linear algebra library methods (dot products, etc.) are probably the biggest time sinks in this code. Interpreted languages like MATLAB are efficient not because they have tight for loops but because they have fast built-in matrix operators.

@ghost
Copy link

ghost commented Jul 8, 2017

Also just FYI I don't mean to discourage work on this, I just tend to play devil's advocate when collaborating on things.

@ghost
Copy link

ghost commented Nov 8, 2017

Assuming that mineral classification is the biggest time sink, the callback we are considering implementing would provide one way to address this issue. The callback could point to an efficient compiled function which would hopefully speed things up even if the rest of the toolkit is in plain old Python. The NumPy SAM implementation is probably not half bad however. At this point the performance analysis is largely speculative so it would probably be best to profile things before attempting to optimize.

@ghost
Copy link

ghost commented Mar 2, 2018

In an effort to keep the issue tracker up to date, I will close this issue and recommend that the action items above be pursued incrementally in future issues: In particular, logging, profiling, and optimizing. However, if there are any arguments in favor of committing to the Cython language then this issue can be reopened and pursued by anyone motivated enough convert all the code, as I am not deeply attached to Python.

The fundamental goal here is to improve efficiency which brings up a lot of subtleties in the current code and architectural issues in future versions of the library that I think have less to do with the language and more to do with the algorithm. The key question is whether the classifier or the code calling the classifier dominates the running time. This is not totally obvious given things like file I/O, but my intuition is that the classification algorithm would dominate especially when more intensive machine learning algorithms are utilized.

What I am most likely to pursue for my own purposes down the road is factoring out the reusable parts of the library from the parts dedicated to coal mining, or otherwise implement them equivalently. A core library built on top of existing high-performance remote sensing and machine learning libraries (Orfeo ToolBox being one very good candidate) and possibly implemented in a language like C++ would expose a general interface that could then be encapsulated in COAL and other projects while maintaining the user-friendly Python API and compatibility with all of the existing deployment processes.

@ghost ghost closed this as completed Mar 2, 2018
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant