Predict Autism Diagnoses effectivenss using Various Machine Learning Models
Contributor: Souvik Mazumdar(Nasdaq)
This use case will analyze anonymized therapy data from a repository of Applied Behavior Analysis sessions and cases to draw insights which can be used by families, doctors and researchers to help understand the effectiveness of their programs
- Goals
- Project Analysis
- Tools
- Setup
- Exploratoty Data Analysis
- Algorithm Analysis
- Conclusion
- Future Scope
- References
- Is 80% as a set goal indicative of future successes? a. Are there data subsets that may use another target? b. How would lagged / lead successes indicate future successes? c. Would subsetting the target(s) in the data, such as by author, goal name, and/or goal domain, offer a better target variable(s)?
- Using the goal set at 80%, or a variation found above, what are the descriptors of those who un/successfully complete goal outcomes? a. Does age and / or sex effect un/succesful outcome?
- What are the descriptors of the treatments for successful / unsuccessful goal outcomes? a. Does treatment intensity (defined as number of trials per day per week) impact outcome progress? ['sessionCount_byGoal_byDayYear'], ['sessionCount_byGoal_byWeekYear'] b. Is there an optimum frequency / cadence of therapies? c. What are the returns of intensity over time (i.e. steeper earlier and flatter later in the curve)?
- Are there variables around the therapist that predict better outcomes? a. Do therapist changes in a goal affect outcomes? [‘trialStarts_byClient*’ *collectively use this byClient variable group] b. Does the number of therapists intervening towards a goal affect the outcomes? [‘authorChanges_byTrialId'], [‘authorChanges_byTrial_Outcome'] c. Could the therapist effect positive outcomes by Goal? By Phase Mode? variables ['trialStarts_byAuthor_Outcome_Goal'], ['trialStarts_byAuthor_Outcome_Phase'] d. Does the time it takes a therapist to chart sessions after data entry correlate with outcomes? [variable 'period_dataLogged_dataGraphed'] e. Does the time in a session outside of charting data correlate with outcomes? ['period_graphedBySession'] f. Is there significance between outcomes and higher numbers of Goals? Higher numbers of Phase Modes? Different types of Phase Modes?
Standardized Data Preparation and Jupyter Notebooks
The data are processed in a standardized way using a Python script that prepares the data for the machine learning classifiers.
Then we implement different models and accuracy validation techniques which as represented below.
Jupyter Notebook: Write code in a virtual environment
Python libraries: numpy, pandas, seaborn, plotly, sklearn (dimensionality reduction, test-train split, gridsearch, cross-validation methods, evaluate performance)
Git: Track file changes
GitHub: Organize team workflow and project
Machine learning: Azure ML Studio
- clone the repo using git clone https://github.com/fsi-hack4autism/ML_Autism_UC3_Souvik.git
- open the file autism_hack.ipynb in any python notebook and your ready to go
Glimpse of the EDA done to understand the data
This show the current goal status of the clients
Session Count Intensity Curve
We can analyse that with more sessions success rate is high. So more sessions should be organised per day/month/year. Highest Concentration of successes is in the initials sessions. So if regular sessions are kept in the initial days, the success rate will be higher.
Correlation Matrix for all dependent variables
Line Plot to show Test Score vs Train Score
Confusion Matrix
F1 Score | Precision Score | Recall Score |
---|---|---|
0.9918818722131946 | 0.9931865209378359 | 0.9906003851387001 |
Accuracy Matrix using Random Forest Algorithm
ROC Curve to validate data which gives a 97.6% accuracy
What do these scores mean?
We get the highest accuracy for Random Forest, with the score reaching 99%. This implies, our model predicted classified correctly 99% of the times. The Precision score stood at 0.99, implying our model correctly classified observations with high importance 99% of the times. The Recall stood at 0.99. We also have an F1 score of 0.99. The F1 score is the harmonic mean of precision and recall. It assigns equal weight to both the metrics. However, for our analysis it is relatively more important for the model to have low false negative cases(Where the client is improving with the sessions).
Among our dependents which features are have the most importance
We thus select the Random Forest Classifier as the right model due to high accuracy, precision and recall score. One reason why Random Forest Classifier showed an improved performance was because of the presence of outliers. Random Forest is not a a distance based algorithm hence it is not much affected by outliers. Whereas distance based algorithm such as Logistic Regression and Support Vector showed a lower performance. Based on the feature importance, Author Changes is the most important factor which can determine a success trial. Other factors also contributes to the prediction. As we can see, the results derived from Feature Importance makes sense as one of the first things that actually affects the outcomes.
Analyzing the Goals by clustering them and assigning the most suited goal for each individual.
More variables to be considered for evaluation success.
Creating ReadMe files : https://bulldogjob.com/news/449-how-to-write-a-good-readme-for-your-github-project
Implementing Algortihms: https://www.analyseup.com/learn-python-for-data-science/python-random-forest-feature-importance-plot.html#:~:text=Tree%20based%20machine%20learning%20algorithms,trying%20to%20predict%20the%20target.
Autism Study : https://www.kaggle.com/faizunnabi/autism-screening-classification
Special Thanks to all the Event Organizers of the Hackathon https://fsi-hack4autism.github.io/#about