Final Project: Big Data Science

Project Overview

In the final project, We hope everyone can think yourself as a real-world data scientist. Your goal is to come up with some interesting questions, find right datasets, and implement a data-processing pipeline to answer those questions. In order to achieve this, please follow the following steps:

  1. Form a data-science team of 2-3 persons. Note that single-person group is not allowed since it is important for data scientisits to know how to collaborate with others.
  2. Brainstorm some project ideas and derive an initial plan
  3. Give a 5-min talk on your project proposal
  4. Present your project in the poster session
  5. Submit your code and report

Todo List

The following table summarizes the TODO list of the final project.

ID
What
When
Where
1 Initial Plan Sunday 03/05 at 11:59 PM Submit the filled form to the CourSys activity Initial Plan
2 Proposal Presentation Monday 03/13 at 08:00 AM
Monday 03/13 at 09:30 AM
Submit your slides to the CourSys activity Presentation
Give a talk in TASC 1 9204 West
3 Poster Session Monday 04/10 at 08:00 AM
Monday 04/10 at 10:00 AM
Submit your poster to the CourSys activity Poster Session
Present your post at the SFU Big Data Hub
4 Code & Report Sunday 04/16 at 11:59 PM Code, Report

Step-by-step Instruction

1. Initial Plan (5 points)

The first thing you need to do is to make a plan. Find the right person(s) that you want to work with and come up with a good project topic. Here are some requirments (or hints) about the topic:

  • You have to tell me what questions you want to answer. Typically, there are two ways to get some interesting questions. One is to first brainstorm the questions that interest you and then find the right datasets to answer them; The other is to first find the datasets that interest you and then find interesting questions by exploring the datasets. Some common types of questions are:

Submission

  • Create your data-science team in CourSys
  • Download the Initial Plan form template
  • Submit the filled form to the CourSys activity Initial Plan
  • The deadline is Sunday 03/05 at 11:59 PM.

2. Proposal Presentation (30 points)

The Initial Plan that you made above would be meaningless if you couldn't persuade your manager to allow you to do it. Thus, at work, it is super important to know how to give a persuasive speech. In order to train you this skill, each group needs to give a 5min talk on the proposed project. Imagine your manager is sitting in the audience, your goal is to convince him/her that:

  • Your project is super cool
  • You can get it done on time

You can easily find a lot of good tips on how to give a persuasive speech. Take a look at them and try to apply them to your speech.

Requirements

  • EVERYONE should get prepared for the speech. In class, we will randomly pick up a student from each team and ask him/her to give the talk.

  • The length of the talk should be 5 mins! Pay attention to the time. If you ended up spending x mins, your team grade would be deducted by |x-5| points (rounding).

  • You are going to give the presention at 09:30 AM on Monday 03/13. But, please upload your slides to the CourSys activity before 08:00 AM.

3. Poster Session (45 points)

This is the show time! Make a poster to present your data product. Here are a few things that you can put into the poster:

  • What questions do you try to answer?

  • What's your methodology to get the answers?

  • What datasets/tools do you use?

  • What's your data processing pipeline like?

  • What's your data product?

  • What have you learnt through the project?

There should be tables available if you would like to do a demo.

Submission

  • The poster session is scheduled at 10:00 AM on Monday 04/10. Please upload your poster to the CourSys activity Poster before 08:00 AM.

4. Code & Report (20 points)

Source Control

Like CMPT 732, you must use a Git repository for your project. The department's GitLab server is a good way to get one (instructions at that link). Group members must commit their own contributions to the repo. Please give the instructors and TAs (jnwang, aguha, sjishan, zcong) developer access to your repository. You are encouraged to publicize and open-source your work on GitHub or similar.

Code Submission

The final implementation is due Sunday 04/16 at 11:59 PM. You will submit a tag from your repository (git tag final; git push --tags) to the CourSys activity Code. In your repository, please include a file README.txt (or README.md if you prefer) indicating how we can actually test your project as well as other notes about things we should look for. If you created some kind of web frontend, please include a URL in the README.md as well.

Report Submission

You will submit a report of at most 5 pages giving an overview of your project.

  • Motivation and Background: Who cares about this project? Any related work?
  • Problem Statement: What questions do you want to answer? Why are they challenging?
  • Data Processing Pipeline: What's your data-processing pipeline like? Describe each component in detail.
  • Methodology: What tools or analysis methods did you use? Why did you choose them? How did you apply them to tackling each problem?
  • Data Product: What's your data product? Please demonstrate how it works.
  • Lessons Learnt: What did you learn from this project?
  • Summary: A high-level summary of your project. It should be self-contained and cover all the important aspects of your project.

This is also due Sunday 04/16 at 11:59 PM, submitted to the CourSys activity Report as a PDF.