CS 417: Data Mining 2019

Spring 2019

Welcome to Data Mining!

All the homework/labs/lectures will be distributed on Moodle.

We meet MWF, 10-11:50 am, AEC 500.

Course Description

Course Description

This course covers the challenges and techniques of utilizing big data.  Through the course, you will learn frameworks such as MapReduce for parallel processing of massive datasets.  You will study algorithms in data mining, and their applications including search engines, online advertising, and recommendation systems.  You will explore the social and ethical implications of using these algorithms.

Data mining algorithms can be applied to large datasets to unearth new knowledge.  This knowledge might predict which web page will answer a user’s question, which products should be marketed together in a store, how to use online stores to generate demand for unknown products, or the most satisfying tour for a museum visitor.  Data mining algorithms are foundational knowledge for today’s data science professions.  This course will help you identify the right algorithm for a particular application.

Using massive datasets, collected from perhaps unknowing users, raises ethical issues that have become a center of attention recently due to a series of inappropriate uses.  During the course, we will examine ethical issues in data mining in the context of recent incidents and the evolving ethical standards in the field.

Instructor

Instructor Information(back up↑)

  • Instructor: Dr. Joann J. Ordille
  • Email: ordillej@lafaytte.edu
  • Office: 522 AEC
  • Office Phone: (610) 330-5416
  • Office Hours: MW: 4-5, T: 1-2, F 2-3. Please email me if you would like to meet at another time. Also, if my door is open, feel free to drop in.

Resources

Useful Resources(back up↑)

Gradiance is our online learning system. Gradiance provides exercises with guidance when you choose an incorrect answer. Create your account using the code I emailed you.

CATME is the class tool for creating teams and evaluating the contribution of team members. This is a general information site. You will receive email for team creation and evaluation.

In the group project, you will form a team and complete the Google Ad Grants Online Marketing Challenge.

  • AWS Educate

I will send you information about joining AWS Educate with Lafayette’s Institutional Membership which will provide you with free resources for some course projects, and for independent study.  We will primarily use AWS Educate to explore the Apache Hadoop Ecosystem, in particular MapReduce.

Apache Zeppelin is installed in multi-user mode on our local server for this course.  You can also download it and install it on your laptop.  It provides a data science workbench that you may find useful in your independent project.  Zeppelin provides Spark as an alternative to Map/Reduce.

  • Moodle

This course uses Moodle extensively. In particular, detailed information for class activities and assignments will be available weekly. Please check Moodle frequently, and whenever you receive an email notification to do so.

  • Papers published in the computer science literature as announced in this syllabus or on Moodle.

We will read two or more papers together to broaden our perspective on data mining. Reading computer science papers is a primary way to keep up with the state of the art once you graduate. You will learn approaches to reading a computer science paper that will increase your understanding of it.

Course Information

Course Information(back up↑)

    • Mining of Massive Datasets, 2nd ed. by Jure Leskovec, Anand Rajaraman and Jeffrey David Ullman. (ISBN: 978-1107077232), Cambridge University Press, 2014.

This book is available for free from http://www.mmds.org/.

    • Data Intensive Text Processing with MapReduce, by Jimmy Lin, Chris Dyer, and Graeme Hirst. (ISBN: 978-1608453429), Morgan and Claypool, 2010.

An earlier but acceptable version of this book is available for free from: http://lintool.github.io/MapReduceAlgorithms/index.html.

This book is also available for free online if you are an ACM Member. ACM is the professional association for computer science.  ACM Membership costs students $19 per year and gives you access to many useful computer science books for free.

  • A Partial List of Required Articles:
    • Sergey Brin, Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. In Computer Networks and ISDN Systems, Volume 30, Number 1, 1998, Pages 107-117, Elsevier.  Available from: http://infolab.stanford.edu/~backrub/google.html.
    • Chris Anderson, The Long Tail. In Wired, October, 2004.  Available from: https://www.wired.com/2004/10/tail/.
    • Julia Stoyanovich, Bill Howe, Serge Abiteboul, Gerome Miklau, Arnaud Sahuguet, and Gerhard Weikum. Fides: Towards a Platform for Responsible Data Science. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management (SSDBM 2017). Article 26, 6 pages, ACM. Available from: https://doi-org.ezproxy.lafayette.edu/10.1145/3085504.3085530 
  • Prerequisites:

Students are required to complete CS 202 (Analysis of Algorithms) successfully before registering for this course. Students are strongly encouraged to have completed or to co-register in Math 272 (Linear Algebra with Applications) or in Math 300 (Vector Spaces).

Course Objectives and Learning Outcomes

Course Objects and Learning Outcomes(back up↑)

Course Objectives:

The goals of this course are to convey:

    • Skill in using a parallel framework to process massive datasets.
    • Understanding of important data mining concepts and techniques.
    • Ability to analyze the time/space tradeoffs when working with big data.
    • Ability to select the proper algorithm for a specific problem.
    • Ability to evaluate the success of a data mining algorithm.
    • Understanding of the ethical issues that arise in using these algorithms.

Course Outcomes: After successfully completing this course, students will be able to:

    • Apply knowledge of computing and mathematics appropriate to data mining and to the computer science discipline.
    • Analyze a problem, and identify and define the computing requirements appropriate to its solution.
    • Design, implement, and evaluate a computer-based system, process, component, or program to meet desired needs.
    • Function effectively on a team to accomplish a common goal.
    • Communicate effectively with a range of audiences.

Assignment Types

(back up↑)Assignment Types

Class Attendance: Class attendance is critical to success.  Every effort should be made to arrive on time and remain attentive for the entire class. In addition to unexcused absence, a pattern of late arrival, early departure or inattentiveness will be considered a violation of the attendance requirement. For every unexcused absence, you will lose 0.5 points from your final grade up to a total of 3 points. If you have a dean’s excuse or a coach’s letter for a particular day, you will be excused from class.

Class Participation: Class participation includes asking or answering questions, expressing an opinion about a topic of discussion, meeting with me during office hours, and reporting on activities in a project, or a reading group. All sincere efforts to participate are admired, so don’t worry, just speak up. You are even welcome to express an opinion different than mine. All types of participation count except participation that shows you failed to prepare for class. For example, asking: “Who is Ada Lovelace?” when the assignment was to read the Countess of Lovelace’s biography would not count as class participation.

Reading Groups: Before the second class meeting, we will form reading groups of four students. Several times during the semester, you will read a paper in the computer science literature, and meet with your reading group to discuss it. I will provide you with a strategy for reading the paper well, and some questions for discussion. Your group will be responsible for submitting a summary of your discussion in writing before the class meets to discuss the paper with me. You may also want to meet with your reading group to study for exams. Your project group may be different from your reading group.

Quizzes: To help you stay current, there may be unannounced quizzes. The quiz may test your knowledge of the previous week or whether you prepared for class by doing the assigned reading. The intention here is to encourage you to review the material each week in preparation for the next week and to do assigned reading before class.  Quizzes will be counted as part of your homework grade.

Homework: Homework assignments will consist of problems and labs in the Gradiance System, and other problems/labs assigned in class. Assignments will be listed in Moodle to help all of us track them. Completing the homework will deepen your understanding, help you build skills necessary for completing the projects, and assist you in preparing for exams. Your homework should be your own work, and not copied or supplied by anyone else. Since we will often discuss homework in class after the due date, late homework will not be accepted. For each Gradiance homework, you can read the assistance and redo the exercise if you make a mistake.

Projects: There will be a group project, and two individual projects.  Each group project will have a team of 4 students, and will involve online advertising and search engine optimization. The individual projects require designing and implementing an application of the algorithms studied, and evaluating the application created. Group and individuals will present their projects to the class.

Exams: We will have one written midterm in class, and a final.  There will be no make-up or early exam sittings without a request from a dean or coach on your behalf.  Before any exam begins, you are required to close your course materials, and put them and your phone in the front of the class to avoid the temptation look at them during the exam.

Academic Honesty

(back up↑)Academic Honesty

It is essential that you follow the Lafayette College Code of Conduct with respect to academic honesty and avoidance of plagiarism as described in the Student Handbook. The beginning of the semester is a good time to review the handbook in this regard.

“To maintain the scholarly standards of the College and, equally important, the personal ethical standards of our students, it is essential that written assignments be a student’s own work, just as is expected in examinations and class participation. A student who commits academic dishonesty is subject to a range of penalties, including suspension or expulsion. Finally, the underlying principle is one of intellectual honesty. If a person is to have self-respect and the respect of others, all work must be his/her own.”

The Handbook gives the following examples of intellectual dishonesty:

    1. Submitting someone else’s work as your own.
    2. Incorporating someone else’s ideas or work into your own without attribution.
    3. Paying or arranging for someone else to do your work.
    4. Re-using material from another course without permission of your instructor.
    5. Engaging in unauthorized collaboration including asking for homework or programming assignment answers from an online discussion group.
    6. Obtaining the Instructor’s Answer Guide to the exercises in the book and using it. (This would also constitute theft, since the guide is only licensed to instructors.)

Academic dishonesty also includes copying explanatory material, such as descriptions of software packages, into documentation without indicating the source of the text and that it was copied.

When in doubt about whether an action is considered academic dishonesty, it is best to consult with me before you act. Cases of suspected intellectual dishonesty will be reported to the Dean, and the Dean will investigate and impose penalties.

Grading

(back up↑)Grading

Graded Material:
The course grade is based on the materials listed below graded on a 100 point scale, with each item contributing a specified percentage to the overall score. As specified in the student handbook, A will reflect excellent work, B will reflect good work, C will reflect acceptable work, and D will reflect passing work.

Class Attendance & Participation 5%
Quizzes and Homework 5%
Group Project 20%
Projects 25%
Mid-Term 20%
Final Exam 25%

Grading Scale:
Typically, grades are assigned as follows from the student’s numerical grade:

A: 93-100 B+: 87-89 C+: 77-79 D+: 67-69 F: 0-59
A-: 90-92 B: 83-86 C: 73-76 D: 63-66
B-: 80-82 C-: 70-72 D-: 60-62

Tentative Schedule

(back up↑)Tentative Schedule*

Week Topic Reading
1 Overview of Big Data and Its Applications Lin, Dyer, Hirst, Chapter 1

Leskovec, et al., Chapter 1

2 MapReduce and the Big Data Software Stack Lin, Dyer, Hirst, Chapter 2

Leskovec, et al., Chapter 2

3 MapReduce Algorithm Design Lin, Dyer, Hirst, Chapter 3
4 Finding Similar Items Leskovec, et al., Chapter 3
5 Web Crawling and Indexing Lin, Dyer, Hirst, Chapter 4
6 Web Link Analysis, Search Engines Lin, Dyer, Hirst, Chapter 5

Leskovec, et al., Chapter 5 – 5.3

Brin and Page, The anatomy of a large-scale hypertextual Web search engine.

7 Search Engine Optimization and Gaming Search Engines Leskovec, et al., Chapter 5.4-5.6
8 Online Advertising, Adwords Leskovec, et al., Chapter 8
9 Group Project Presentations
10 Frequent Item Sets and Brick and Mortar Marketing Leskovec, et al., Chapter 6
11 Clustering Leskovec, et al., Chapter 7
12 Recommendation Systems Leskovec, et al., Chapter 9, and
Anderson, The Long Tail.
13 Recommendation Systems
14 Fairness, Accountability and Transparency in Data Science Algorithms Julia Stoyanovich, Bill Howe, Serge Abiteboul, Gerome Miklau, Arnaud Sahuguet, and Gerhard Weikum. Fides: Towards a Platform for Responsible Data Science.
15 Final Project Presentations

 

*Tentative schedule, subject to change. Check Moodle for the most up to date information.

**No travel arrangements should be made until the final exam schedule has been issued

Additional Information

(back up↑)Additional Information

SMS Advisor: Sexual Misconduct Support Advisors, formerly SASH, is a campus organization that seeks to prevent sexual and gender based violence and harassment. It also provides services to those who have suffered from such violence or harassment. I am an SMS Advisor, and will be on call 24/7 for one week this semester. During that week, in the unlikely event that I receive a call during class, you will continue with your lab or assignment while I assist the caller. If you or a friend ever needs an advisor, call or text 484 548-0325.

Respect for classmates, colleagues and team members: All students are expected to show respect and courtesy to each other. Mutual respect is a high ideal in academic, business, and personal life. It is central to learning and working well together. Disagreements over ideas or constructive criticism of someone’s work is in keeping with this ideal. Attacking or disparaging someone is not, and will not be tolerated. In group projects, mutual respect also includes reliably contributing to the project and keeping your commitments to the group.

We follow the College Diversity Statement which says in part:

All members of the College community share a responsibility for creating, maintaining, and developing a learning environment in which difference is valued, equity is sought, and inclusiveness is practiced.

To learn more about how these principles are followed in the computing industry, view the Google video:

Diversity at Google (https://youtu.be/_3RoQRN65AI)

and the eBay video:

Diversity Workshop at eBay, Europe (https://player.vimeo.com/video/159767606)

Privacy: Moodle contains student information that is protected by the Family Educational Right to Privacy Act (FERPA). Disclosure to unauthorized parties violates federal privacy laws. Courses using Moodle will make student information visible to other students in this class. Please remember that this information is protected by these federal privacy laws and must not be shared with anyone outside the class. Questions can be referred to the Registrar’s Office.

Equal Access: In compliance with Lafayette College policy and equal access laws, I am available to discuss appropriate academic accommodations that you may require as a student with a disability. Requests for academic accommodations need to be made during the first two weeks of the semester, except for unusual circumstances, so arrangements can be made. Students must register with the Office of the Dean of the College for disability verification and for determination of reasonable academic accommodations.

Important Dates:

  • Normal Add/Drop deadline: February 8th
  • Spring Break: March 18th – 22nd
  • Last day to Withdraw (WD): April 22nd
  • Classes end: May 10th
  • Final Exams: May 13th – 20th

Federal credit hour statement: The student work in this course is in full compliance with the federal definition of a four credit hour course. Please see the Registrar’s Office Website (https://registrar.lafayette.edu/wp-content/uploads/sites/193/2013/04/Federal-Credit-Hour-Policy-Web-Statement.doc)  for the full policy and practice statement.

CS 417: Data Mining
Copyright © 2019, 2018 Joann J. Ordille.