The goal of this project is to get a sense for how you would go about building a data warehouse.
You have a team of data analyts interested in helping build the next big movie hit by analyzing IMDB data. They are interested to answer questions like below (you do NOT need to answer these questions in your deliverables):
- Which movie had the highest rating per country and year?
- What are the average ages of the actors for each movie?
- Design a data model consisting of FACT and DIMENSION tables that can be used to answer the above questions, as well as offer flexibility for further exploration.
- Implement a program/pipeline that transforms the input data into a form usable by the data model.
- Data model (visual diagram)
- Implementation (whatever tools and tech stack you want to actually move the data from files to warehouse)
- Supporting documentation
- Download the IMDB data as a zip file from this repo
- This Kaggle data set is the original source
- Feel free to email questions to your recruiting contact, however we do not want you to wait on replies in order to move forward. For most things, simply document your assumptions and move on.
- Use Git/GitHub if possible:
- Clone the repo to your own
- Store your files and solution in your cloned repo
- Provide a link to us when it's complete (no need to submit a Pull Request as we'd like to protect your confidentiality during the hiring process).