CS 417: Data Mining

Spring 2021

Welcome to Data Mining!

All the homework/labs/lectures will be distributed on Moodle.

We meet MWF, 10-10:50 am via the video conferencing link on Moodle.

Course Description

Course Description

This course covers the challenges and techniques of utilizing big data. Through the course, you will learn frameworks such as MapReduce and Spark for parallel processing of massive datasets. You will study algorithms in data mining, and their applications including search engines, online advertising, and recommendation systems. You will explore the social and ethical implications of using these algorithms.

Data mining algorithms can be applied to large datasets to unearth new knowledge. This knowledge might predict which web page will answer a user’s question, which products should be marketed together in a store, how to use online stores to generate demand for unknown products, or the most satisfying tour for a museum visitor. Data mining algorithms are foundational knowledge for today’s data science professions. This course will help you identify the right algorithm for a particular application.

Using massive datasets, collected from perhaps unknowing users, raises ethical issues that have become a center of attention recently due to a series of inappropriate uses. During the course, we will examine ethical issues in data mining in the context of recent incidents and the evolving ethical standards in the field.

Instructor

Instructor Information(back up↑)

  • Instructor: Dr. Joann J. Ordille
  • Email: ordillej@lafaytte.edu
  • Office: 565 RISC
  • Office Phone: (610) 330-5416
  • Office Hours: MW: 3-5, F: 3-4 online on class link. Please email me if you would like to meet at another time.

Resources

Useful Resources(back up↑)

Gradiance is our online learning system. Gradiance provides exercises with guidance when you choose an incorrect answer. Create your account using the code I emailed you.

CATME is the class tool for creating teams and evaluating the contribution of team members. This is a general information site. You will receive email for team creation and evaluation.

  • AWS Educate

I will send you information about joining AWS Educate with Lafayette’s Institutional Membership which will provide you with free resources for some course projects, and for independent study. We will primarily use AWS Educate to explore the Apache Hadoop Ecosystem, in particular MapReduce.

Apache Zeppelin is installed in multi-user mode on our local server for this course. You can also download it and install it on your laptop. It provides a data science workbench that you may find useful in your independent project. Zeppelin provides Spark as an alternative to Map/Reduce.

  • Moodle

This course uses Moodle extensively. In particular, detailed information for class activities and assignments will be available weekly. Please check Moodle frequently, and whenever you receive an email notification to do so.

  • Papers published in the computer science literature as announced in this syllabus or on Moodle.

We will read two or more papers together to broaden our perspective on data mining. Reading computer science papers is a primary way to keep up with the state of the art once you graduate. You will learn approaches to reading a computer science paper that will increase your understanding of it.

Course Information

Course Information(back up↑)

Mining Massive Datasets, Cover Art

    • Mining of Massive Datasets, 3rd ed. by Jure Leskovec, Anand Rajaraman and Jeffrey David Ullman. (ISBN: 978-1108476348), Cambridge University Press, 2020.

This book is available for free from http://www.mmds.org/.

    • Data Intensive Text Processing with MapReduce, by Jimmy Lin, Chris Dyer, and Graeme Hirst. (ISBN: 978-1608453429), Morgan and Claypool, 2010.

An earlier but acceptable version of this book is available for free from: http://lintool.github.io/MapReduceAlgorithms/index.html.

This book is also available for free online if you are an ACM Member. ACM is the professional association for computer science. ACM Membership costs students $19 per year and gives you access to many useful computer science books for free.

  • Required articles will be posed on Moodle.
  • Prerequisites:

Students are required to complete CS 202 (Analysis of Algorithms) successfully before registering for this course. Students are strongly encouraged to have completed or to co-register in Math 272 (Linear Algebra with Applications) or in Math 300 (Vector Spaces).

Course Objectives and Learning Outcomes

Course Objects and Learning Outcomes(back up↑)

Course Objectives:

The goals of this course are to convey:

    • Skill in using a parallel framework to process massive datasets.
    • Understanding of important data mining concepts and techniques.
    • Ability to analyze the time/space tradeoffs when working with big data.
    • Ability to select the proper algorithm for a specific problem.
    • Ability to evaluate the success of a data mining algorithm.
    • Understanding of the ethical issues that arise in using these algorithms.

Course Outcomes: After successfully completing this course, students will be able to:

    • Apply knowledge of computing and mathematics appropriate to data mining and to the computer science discipline.
    • Analyze a problem, and identify and define the computing requirements appropriate to its solution.
    • Design, implement, and evaluate a computer-based system, process, component, or program to meet desired needs.
    • Function effectively on a team to accomplish a common goal.
    • Communicate effectively with a range of audiences.

Assignment Types

(back up↑)Assignment Types

Class Participation: Class participation includes asking or answering questions, expressing an opinion about a topic of discussion, meeting with me during office hours, interacting with me on Slack, participating in your team and reporting on activities in a team project or other activity. All sincere efforts to participate are admired, so don’t worry, just speak up. You are even welcome to express an opinion different than mine. All types of participation count except participation that shows you failed to prepare for class. For example, asking: “Who is Ada Lovelace?” when the assignment was to read the Countess of Lovelace’s biography would not count as class participation. But, it’s always better to ask than to sit there in the dark.

Lightweight Team Participation: Research has shown that students learn better in a community with their peers. We hope to help you form that community by creating lightweight teams that will collaborate in various course activities. The teams are lightweight, because they are for learning collaboratively without a lot of grade stress. Your contribution to your team counts for 2% of your grade.  Your lightweight team will also function as your reading group.  Several times during the semester, you will read a paper in the computer science literature, and meet with your lightweight team to discuss it. I will provide you with a strategy for reading the paper well, and some questions for discussion. Your group will be responsible for submitting a summary of your discussion in writing before the class meets to discuss the paper with me. You may also want to meet with your lightweight team to study for exams. Your software project group may be different from your lightweight team.

Quizzes: To help you stay current, there may be unannounced quizzes. The quiz may test your knowledge of the previous week or whether you prepared for class by doing the assigned reading. The intention here is to encourage you to review the material each week in preparation for the next week and to do assigned reading before class. Quizzes will be counted as part of your homework grade.

Homework: Homework assignments will consist of problems and labs in the Gradiance System, and other problems/labs assigned in class. Assignments will be listed in Moodle to help all of us track them. Completing the homework will deepen your understanding, help you build skills necessary for completing the projects, and assist you in preparing for exams. Your homework should be your own work, and not copied or supplied by anyone else. Since we will often discuss homework in class after the due date, late homework will not be accepted. For each Gradiance homework, you can read the assistance and redo the exercise if you make a mistake.

Projects:  The projects require designing and implementing an application of the algorithms studied, and evaluating the application created. Each student will be required to present one of their projects to the class.

Exams: We will have one written midterm in class, and a final. All work on the exam must be your own. There will be no make-up or early exam sittings without a request from the Dean or a Coach on your behalf.

Academic Honesty

(back up↑)Academic Honesty

It is essential that you follow the Lafayette College Code of Conduct with respect to academic honesty and avoidance of plagiarism as described in the Student Handbook. The beginning of the semester is a good time to review the handbook in this regard.

“To maintain the scholarly standards of the College and, equally important, the personal ethical standards of our students, it is essential that written assignments be a student’s own work, just as is expected in examinations and class participation. A student who commits academic dishonesty is subject to a range of penalties, including suspension or expulsion. Finally, the underlying principle is one of intellectual honesty. If a person is to have self-respect and the respect of others, all work must be his/her own.”

The Handbook gives the following examples of intellectual dishonesty:

    1. Submitting someone else’s work as your own.
    2. Incorporating someone else’s ideas or work into your own without attribution.
    3. Paying or arranging for someone else to do your work.
    4. Re-using material from another course without permission of your instructor.
    5. Engaging in unauthorized collaboration including asking for homework or programming assignment answers from an online discussion group.
    6. Obtaining the Instructor’s Answer Guide to the exercises in the book and using it. (This would also constitute theft, since the guide is only licensed to instructors.)

Academic dishonesty also includes copying explanatory material, such as descriptions of software packages, into documentation without indicating the source of the text and that it was copied.

When in doubt about whether an action is considered academic dishonesty, it is best to consult with me before you act. Cases of suspected intellectual dishonesty will be reported to the Dean, and the Dean will investigate and impose penalties.

Grading

(back up↑)Grading

Graded Material:
The course grade is based on the materials listed below graded on a 100 point scale, with each item contributing a specified percentage to the overall score. As specified in the student handbook, A will reflect excellent work, B will reflect good work, C will reflect acceptable work, and D will reflect passing work.

Class Attendance & Participation 2%
Lightweight Team Participation 2%
Quizzes and Homework 7%
Projects 20%
First Midterm 23%
Second Midterm 23%
Final Exam 23%

Grading Scale:
Typically, grades are assigned as follows from the student’s numerical grade:

A: 93-100 B+: 87-89 C+: 77-79 D+: 67-69 F: 0-59
A-: 90-92 B: 83-86 C: 73-76 D: 63-66
B-: 80-82 C-: 70-72 D-: 60-62

Tentative Schedule

(back up↑)Tentative Schedule*

Week Topic Reading
1 Overview of Big Data and Its Applications Lin, Dyer, Hirst, Chapter 1

Leskovec, et al., Chapter 1

2 MapReduce and the Big Data Software Stack Lin, Dyer, Hirst, Chapter 2

Leskovec, et al., Chapter 2

3 MapReduce Algorithm Design, Spark Lin, Dyer, Hirst, Chapter 3

Zaharia, et al., Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, 2012.

4 Finding Similar Items Leskovec, et al., Chapter 3
5 Mining Data Streams Leskovec, et al., Chapter 4

Akidau, et al., The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing, 2015.

6 Web Crawling and Indexing Lin, Dyer, Hirst, Chapter 4
7 Web Link Analysis, Search Engines Lin, Dyer, Hirst, Chapter 5

Leskovec, et al., Chapter 5 – 5.3

Brin and Page, The anatomy of a large-scale hypertextual Web search engine, 1998.

8 Search Engine Optimization and Gaming Search Engine Leskovec, et al., Chapter 5.4-5.6
9 Online Advertising, Adwords Leskovec, et al., Chapter 8
10 Frequent Item Sets and Brick and Mortar Marketing Leskovec, et al., Chapter 6
11 Clustering Leskovec, et al., Chapter 7
12 Recommendation Systems Leskovec, et al., Chapter 9, and
Anderson, The Long Tail.
13 Recommendation Systems
14 Fairness, Accountability and Transparency in Data Science Algorithms Julia Stoyanovich, Bill Howe, Serge Abiteboul, Gerome Miklau, Arnaud Sahuguet, and Gerhard Weikum. Fides: Towards a Platform for Responsible Data Science, 2017.
15 Project Presentations

 

*Tentative schedule, subject to change. Check Moodle for the most up to date information.

**No travel arrangements should be made until the final exam schedule has been issued

Additional Information

(back up↑)Additional Information

SM&RT (SMART) Advisor: Advisors in the Sexual Misconduct & Resource Training Program, formerly SASH, seek to prevent sexual and gender based violence and harassment. They also provide services to those who have suffered from such violence or harassment. I am an SMART Advisor, and am available through appointment to assist you or your friends if you experience sexual misconduct. As a SMART Advisor, I can keep the names of those involved or those consulting me private.

Respect for classmates, colleagues and team members: All students are expected to show respect and courtesy to each other. Mutual respect is a high ideal in academic, business, and personal life. It is central to learning and working well together. Disagreements over ideas or constructive criticism of someone’s work is in keeping with this ideal. Attacking or disparaging someone is not, and will not be tolerated. In group projects, mutual respect also includes reliably contributing to the project and keeping your commitments to the group.

We follow the College Diversity Statement which says in part:

All members of the College community share a responsibility for creating, maintaining, and developing a learning environment in which difference is valued, equity is sought, and inclusiveness is practiced.

To learn more about how these principles are followed in the computing industry, view the Google video:

Diversity at Google (https://youtu.be/_3RoQRN65AI)

and the eBay video:

Diversity Workshop at eBay, Europe (https://player.vimeo.com/video/159767606)

Privacy: Moodle contains student information that is protected by the Family Educational Right to Privacy Act (FERPA). Disclosure to unauthorized parties violates federal privacy laws. Courses using Moodle will make student information visible to other students in this class. Please remember that this information is protected by these federal privacy laws and must not be shared with anyone outside the class. Questions can be referred to the Registrar’s Office.

Equal Access: In compliance with Lafayette College policy and equal access laws, I am available to discuss appropriate academic accommodations that you may require as a student with a disability. Requests for academic accommodations need to be made during the first two weeks of the semester, except for unusual circumstances, so arrangements can be made. Students must register with the Office of the Dean of the College for disability verification and for determination of reasonable academic accommodations.

Important Dates:

  • Normal Add/Drop deadline: February 19th
  • Spring Break: March 30th – 31st
  • Last day to Withdraw (WD): April 26th
  • Classes end: May 19th
  • Final Exams: May 22nd – 29th

Federal credit hour statement: The student work in this course is in full compliance with the federal definition of a four credit hour course. Please see the Registrar’s Office Website (https://registrar.lafayette.edu/wp-content/uploads/sites/193/2013/04/Federal-Credit-Hour-Policy-Web-Statement.doc) for the full policy and practice statement.

CS 417: Data Mining
Copyright © 2021, 2019, 2018 Joann J. Ordille.