Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiindex necessary? #410

Open
musicnerd opened this issue Aug 25, 2016 · 10 comments
Open

Multiindex necessary? #410

musicnerd opened this issue Aug 25, 2016 · 10 comments

Comments

@musicnerd
Copy link
Collaborator

I was talking with Alex M. today about some annoyances that arise seemingly out of VIS's use of the multiindexed dataframes. Although Alex explained how the multiindex can be used, it seems that nobody is really making use of it. That is, everything most users do, they could do with just simple (non-multiindexed) pandas dataframes.

Getting rid of the extra "level" would make a lot of the indexing and renaming much simpler, no? Also, by removing the multiindex level, newer users to VIS would have an easier time manipulating objects and using the tools since there are a lot of tutorials, etc. for using pandas dataframes (but some of the manipulations as shown in those tutorials don't work within VIS because of the multiindexed dataframe). Thoughts?

@musicnerd
Copy link
Collaborator Author

Well for now, yes all 5 users. Do you use the multiindex? Simpler is always better exactly! So get rid of the multiindex! I'm not suggesting that things can't be done, it's just confusing for new users (something that will presumably be an issue in the future if we ever want other people to use it) and re-naming is a pain. I'm just trying to figure out why it was set up like this in the first place and why it remains necessary (and/or how hard it would be to get rid of it and keep VIS's functionality)

@alexandermorgan
Copy link
Contributor

alexandermorgan commented Aug 26, 2016

Perhaps @crantila could speak to the logic behind the multi-index in its original conception. The utility of labelling data is sort of obvious but there are a couple of new directions that we have been taking with VIS that make the columnar multi-index either increasingly troublesome or unnecessary.

It is more troublesome than before because if you want to filter one dataframe based on the values in another dataframe, you have to first remove the multi-indecies of both dataframes. We have only recently started to do this type of conditional filtering quite a lot. For example, say you have the noterest results of a piece saved to the variable nr, and the durations of those notes and rests saved to the variable dr. Both nr and dr are dataframes, and they are the same shape. But since their multi-indecies don't match (because the first level of nr is 'noterest.NoteRestIndexer' and the first level of dr is 'meter.DurationIndexer') you can't filter the notes for their durations until you remove or index through the first level of both multi-indecies. So, specifically if you were looking for all the instances of 'C4' that last at least a half note (2), you have to run this line of code:

nr['noterest.NoteRestIndexer'][nr['noterest.NoteRestIndexer'] == 'C4'][dr['meter.DurationIndexer'] >= 2]

But, if we didn't have multi-indexed columns, that same line of code would look like this:

nr[nr == 'C4'][dr >= 2]

@mborsodi , you were using conditional filtering quite a bit in your final projects, would you care to weigh in on this issue?

Ok, so that was the ease-of-use consideration. In VIS 3, the multi-index will not be needed as much as it was before because most results will be stored as values in a dictionary of analyses stored as an attribute of an indexed_piece. Previously this caching was only done for the noterest indexer results. Since the key that you use to access a dataframe in this analysis dictionary via indexed_piece.get_data() is the name of the indexer in question, the label is still there, just not in the same place. It's not as if we have a bunch of dataframes just floating around without knowing what they correspond to. In fact, the dictionary keys are more precise because in addition to showing what indexer they were run with, the dataframes can be linked to the piece they were run on since they are associated with a specific indexed piece.

@alexandermorgan
Copy link
Contributor

This is related to #394 .

@crantila
Copy link
Member

In fact, the MultiIndex columns are the simplest solution. What nobody remembers is the problems they solve. Multi-indexed columns allow the results of every Indexer used in an analysis to be stored together, which is better for remotely caching results. Multi-indexed columns allow every single Indexer to have the same data structure for input and output: a single DataFrame. Finally, multi-indexed columns avoid ever having to ask "but where did these data come from?" Just look at the DataFrame and you can see every step along the way.

VIS isn't optimized to be easy to learn or use. If something you found on StackOverflow doesn't work with VIS, too bad, it's your responsibility to understand what you're doing and how to do it properly. If you don't like typing the long, namespaced indexer names, too bad, they aren't there for your sake to begin with.


It's interesting for me to watch how VIS evolves over time. Before it was a Framework, VIS was a set of music21 scripts. As the GUI grew, the scripts evolved into an MVC-like structure where the data manipulation was still poorly-structured music21 scripts (VIS 1 through 9). As we dreamed of the web and analyses beyond counterpoint, we needed not specific to one UI, so we adopted pandas (VIS Framework 1.0). As we gained engineering experience and realized that nobody was writing and running VIS scripts by hand, we moved toward a more robust, highly predictable, but less flexible structure (VIS Framework 2.0).

Times have changed. RODAN learned how to handle things other than OMR, and now the robustness and predictability happens there. Unlikely though it was, more people started writing VIS scripts. Caching on remote servers has failed to materialize, but there's a new caching plan. Even though we've learned how to handle NaN values, now it's just a pain in the ass.

Multi-indexed columns were the simple solution. The situation has changed and those problems are solved in different (significantly more complicated) ways now, so the column names end up looking bizarre and superfluous.

Let's shed a single tear: multi-indexed columns should go.

@alexandermorgan
Copy link
Contributor

@crantila , your post is both informative and hilarious. Bravo!

@ahankinson
Copy link

The situation has changed and those problems are solved in different (significantly more complicated) ways now...

Could you explain this a bit more?

@crantila
Copy link
Member

@ahankinson stop teasing, surely people have other things to do than write extremely verbose github comments, like writing documentation... 😛

That's a strange way to say "please help us by working on the API!"

@ahankinson VIS Framework was designed to run on a server as the primary thing in charge. That was the biggest influence on architectural decisions like the multi-indexed columns. Since I left, there's been a trend to decentralize complexity: server-side data flow is handled by RODAN; data visualization by VIS-Ualizer; nobody even remembered WorkflowManager or AggregatedPieces; for some reason there are extensions for both Max/MSP and PureData; and incomprehensibly the melodic search is being implemented in an extension instead of as Indexers.

For the most part this means VIS itself can be simplified, but not always. Some of these changes are obviously better design decisions, some of them seem motivated by other concerns, but each one of them contributes to significantly more complicated overall solutions.

@crantila
Copy link
Member

Cleverly edited response. Snarky comments in produce snarky comments out. I'm sorry for criticizing decisions I don't understand. The point remains: the benefits you're seeing come at the cost of higher overall system complexity.

If you're looking for substantial effort, I am available for contract work. Otherwise I would be happy to help with specific tasks.

@alexandermorgan
Copy link
Contributor

@crantila thanks for your comments. I think you make a good point about how the current solution of keeping the indexer and class information in the columnar multi-index is the simplest solution as far as the program architecture is concerned. So the ease of use we're looking for with the removal of this multi-index does come at the cost of somewhat more complexity in the backend. I hadn't thought of it that way but I agree with it. More generally, @crantila your feedback and/or critiques on this and any other topics are always welcome and appreciated.
With respect to the dictionary of analyses I mentioned in my comments above, it already appears on alex_devel but I won't push it to develop until I'm done fixing tests, adding new ones, and documenting the changes. Only once those changes are resolutely squared away would it be appropriate (in my opinion) to remove the first level of the columnar multi-index.

@ahankinson
Copy link

Thanks, Christopher. I was genuinely interested in your response to my question, and not trying to provoke anything. I think the intention for the melodic search is that it will be rolled into an indexer, but it's just sitting outside VIS while we figure it out.

@alexandermorgan alexandermorgan changed the title Multiindexer necessary? Multiindex necessary? Sep 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants