Skip to content

Latest commit

 

History

History
108 lines (73 loc) · 8.57 KB

README.md

File metadata and controls

108 lines (73 loc) · 8.57 KB

IBM Data Science - Coursera

What is Data Science

Data science is an area that aims to study and analyze structured and unstructured data. Data science is a mixture of programming, mathematics, statistics and probability (among others, as it is an interdisciplinary area) where this knowledge is used to manipulate, explore and extract knowledge from structured and unstructured data.

Data Scientist is a curious person who is passionate in always learning. He/She needs to acquire knowledge in mathematics, statistics, probability and programming. All this knowledge will help him/her to manipulate and extract information from structured and unstructured data. The Data Scientist is also a person who knows how to tell a story from the results he/she has obtained.

More about Storytelling here

Murtaza Haider, Professor at Ryerson University:

I think what you need to do to is to see, given the pool of applicants you have, who has the most resonance with your firm's DNA. Because you can teach analytics skills, anyone can learn analytics skills if they dedicate time and effort to it. But what really matters is who's passionate about the kind of business that you do.

And I would say if I'm looking for someone, if I have to put together a data science team, I would first look for curiosity. Is that person curious about things not just for data science but anything like, are they curious about why this room is painted a certain way, why do the bookshelves have books, and what kinds of books? They have to have a certain degree of curiosity about everything that is in their vision, that they look at. The second thing is do they have a sense of humor because, you see, you have to have a lighthearted about it. If someone is too serious about it, they probably would take it too seriously, and would not be able to look at the lighter elements. The third thing I think, and I think the last thing that I would look for if I had to have a hierarchy, the last thing I would look for are technical skills. I would go through the social skills, curiosity, and sense of humor. The ability to tell a story. The ability to know that there is a story there. And then once all is there then I would say, well, can you do the technical side of it? And if there is some hope or some sign of some technical skills, I would take them because I can train them in whatever skills they need. But I cannot teach curiosity. I cannot teach storytelling. I cannot certainly, instill sense of humor in anyone.

😊 👏

According to the course material, a final deliverable in the form of a report, has the following 10 main components:

  1. Cover page
  2. Table of contents
  3. Introductory section
  4. Methodology section
  5. Results section
  6. Discussion section
  7. Conclusion section
  8. References
  9. Acknowledgment
  10. Appendix

Another interesting version of the 10 components of a Data Science project here

Python Project for Data Science

Tools for Data Science

The languages of Data Science: Python, R, and SQL are the languages recommended. But there are so many others that have their own strengths and features. Scala, Java, C++, and Julia are some of the most popular. Javascript, PHP, Go, Ruby, and Visual Basic all have their own unique use cases as well.

Roles available for people who are interested in getting involved in data science:

  • Business Analyst
  • Database Engineer
  • ata Analyst
  • Data Engineer
  • Data Scientist
  • Research Scientist
  • Software Engineer
  • Statistician
  • Product Manager
  • Project Manager
  • Etc!

Most used languages for data science

  1. [Python](https://www.python.org/) is useful for many situations, including data science, AI and machine learning, web development, and IoT devices like the Raspberry Pi.
  2. [R](https://www.r-project.org/) is popular in academia but companies that use R include IBM, Google, Facebook, Microsoft, Bank of America, Ford, TechCrunch, Uber, and Trulia and R has become the world’s largest repository of statistical knowledge.
  3. SQL was designed for managing data in relational databases.

Data Science Methodology

  1. Data Management
  2. Data Integration and Transformation
  3. Data Visualisation
  4. Model Building
  5. Model Deployment
  6. Model Monitoring ans Assessment

Development environments

  • One of the most popular current development environments that data scientists are using is “Jupyter.” Jupyter first emerged as a tool for interactive Python programming; it now supports more than a hundred different programming languages through “kernels.”

RStudio is one of the oldest development environments for statistics and data science, having been introduced in 2011.

API

  • An API lets two pieces of software talk to each other. API is a set of standards that enable cross-platform communication through a series of standards and protocols.

Data sets

  • A data set is a structured collection of data. Data embodies information that might be represented as text, numbers, or media such as images, audio, or video files. A data set that is structured as tabular data comprises a collection of rows, which in turn comprise columns that store the information. Data set can be public or private.

Machine Learning Models

  • Machine learning uses algorithms – also known as ”models” - to identify patterns in the data. The process by which the model learns these patterns from data is called “model training." Once a model is trained, it can then be used to make predictions. When the model is presented with new data, it tries to make predictions or decisions based on the patterns it has learned from past data. Machine learning models can be divided into three basic classes: supervised learning, unsupervised learning, and reinforcement learning.

Open Source Tools for Data Science

Data Analysis with Python

The module discusses the importance of data analysis in obtaining useful information, answering questions, and predicting the unknown future. Using an example of a friend who wants to sell his car, the module explains how data analysis can help determine the best price considering characteristics such as color, brand, and horsepower. The module then introduces data preprocessing techniques, such as handling missing values, data normalization, and conversion of categorical variables to numeric. Exploratory data analysis is also covered, including descriptive statistics, data grouping, and correlation analysis. Additionally, the concept of simple and multiple linear regression is introduced, along with how to fit these models in Python using the scikit-learn library. Finally, the module discusses the importance of cross-validation in machine learning model evaluation and presents useful functions for calculating cross-validation scores and predictions.

To be continue...