Spring 2023 | Haiyun He

`Spring 2023` ORIE 4741/5741 Learning with Big Messy Data

Course Website: https://canvas.cornell.edu/courses/51457

Lecture time: 2:45-4:00 pm every Tuesday & Thursday
Discussion sections: Monday 12:25-1:15 pm, Tuesday 10:10-11:00pm, Wednesday 9:05-9:55 and 10:10-11:00. (No discussion sections in the first week)

Information

Modern data sets, whether collected by scientists, engineers, physicians, bureaucrats, financiers, or tech billionaires, are often big, messy, and extremely useful. This course addresses scalable robust methods for learning from big messy data. We will cover techniques for learning with data that is messy–consisting of measurements that are continuous, discrete, boolean, categorical, or ordinal, or of more complex data such as graphs, texts, or sets, with missing entries and with outliers–and that is big–which means we can only use algorithms whose complexity scales linearly in the size of the data. We will cover techniques for cleaning data, supervised and unsupervised learning, finding similar items, model validation, and feature engineering. The course will culminate in a final project in which students extract useful information from a big messy data set.

Prerequisites

Familiarity with linear algebra and matrix notation, a modern scripting language (such as Python, Matlab, Julia, R), and basic complexity and \(O(n)\) notation. More formally, we strongly recommend

Linear Algebra (MATH 2940 or equivalent). Important topics: inner products, matrix multiplication, singular value decomposition.
Probability (ENGRD 2700 or equivalent). Important topics: random sampling, maximum likelihood estimation.
Programming (ENGRD/CS 2110 or equivalent). Important topics: basic comfort in a scripting language, iteration, functions.

Syllabus

Exploratory data analysis
The perceptron algorithm, Support Vector Machine (SVM)
Linear regression
Feature engineering
Train, test, validate, Generalization
Singular value decomposition (SVD), quadratic regularization
Boostrap and bias-variance tradeoff
Trees and forests
Various loss functions and regularization
Clustering and K-means
Contemporary topics in machine learning

Textbooks and readings

We will not require students to purchase any textbook; the information you need to know will be posted as lecture slides or notes. However, we heartily recommend all of the following, and will be drawing on ideas from many of these:

Learning from Data, Abu-Mostafa, Magdon-Ismail, and Lin (cheap, and strongly recommended)
Foundations of Machine Learning, Mohri, Rostamizadeh, and Talwalkar
(free online) Introduction to Statistical Learning, James, Witten, Hastie, and Tibshirani
(free online) Feature Engineering and Selection, Max Kuhn and Kjell Johnson
(free online) Understanding Machine Learning Shalev-Shwartz and Ben-David
(free online) Mining of Massive Datasets, Leskovec, Rajaraman, and Ullman
(free online) Foundations of Data Science, Hopcroft and Kannan
Artificial Intelligence, Russell and Norvig. http://aima.cs.berkeley.edu/
Pattern Recognition and Machine Learning, Bishop
Python Tutorial 2021, Derek Banas. https://www.youtube.com/watch?v=H1elmMBnykA
Python Machine Learning, Raschka. https://www.amazon.com/Python-Machine-Learning-Sebastian-Raschka/dp/1783555130
Linear Algebra and Learning from Data Open Course, Gilbert Strang. https://math.mit.edu/~gs/

Spring 2023 ORIE 4741/5741 Learning with Big Messy Data

Information

Prerequisites

Syllabus

Textbooks and readings

Favourite Projects of Spring 2023

`Spring 2023` ORIE 4741/5741 Learning with Big Messy Data