Repository for Use Case 3, Machine Learning from Morgan Stanley's PADT Services team.
As a part of the hackathon, explored answers to the following questions,
Can we find commonalities among cases to create segments and find benchmarks based on looking at the data alone?
and
Observed data such as gender information to be irrelevant to actual percentage of successful trails.
Future goal is to utilizein built feature selector in python to improve clustering results and utilize more features. Initial focus was to get a basic model set up to answer, the question Other key feature selection tasks:
- aggregating duration/time period to a singular numeric value
- replacing nan values with mean for continuous varible/featues like lag
- replace nan value with 0 or 1 ( discrete values) for dsicrete varaibles
- aggregations along trialIdx, sessionIdx, to simplify initial analysis.
- one hot encoding for calssification if not already present
- K-Means Clustering with clusteval to find best cluster with shielloute score.
- PCA to get a 2D picture of the cluster by using Dimensionality reduction
- Implement CART algoritm analysis to better estimate feature importance
- Work on intepretability and evaluation of clustering
- Try to answer the question on groupings based on goal/skill domain by framing it as a supervised learning problem and utilizing random forest/decision trees.
- Explore density based clustering methods to find other patterns in the data
- Continue working on this project outside of the hackathon
Thanks to the fanatasic organizers and tech leads in machine learning group for answering all of the questions.