Spring 2024

Spring 2024 ORIE 4741/5741

Spring 2024 ORIE 4741/5741 Learning with Big Messy Data


Course Website: https://canvas.cornell.edu/courses/62820

  • Lecture time: 2:55-4:10 pm every Tuesday & Thursday, Gates Hall G01
  • Discussion sections: Monday 12:25-1:15 pm, Tuesday 10:10-11:00pm, Wednesday 9:05-9:55 and 10:10-11:00. (No discussion sections in the first week)

Information

Modern data sets, whether collected by scientists, engineers, physicians, bureaucrats, financiers, or tech billionaires, are often big, messy, and extremely useful. This course addresses scalable robust methods for learning from big messy data. We will cover techniques for learning with data that is messy–consisting of measurements that are continuous, discrete, boolean, categorical, or ordinal, or of more complex data such as graphs, texts, or sets, with missing entries and with outliers–and that is big–which means we can only use algorithms whose complexity scales linearly in the size of the data. We will cover techniques for cleaning data, supervised and unsupervised learning, finding similar items, model validation, and feature engineering. The course will culminate in a final project in which students extract useful information from a big messy data set.

Prerequisites

Familiarity with linear algebra and matrix notation, a modern scripting language (such as Python, Matlab, Julia, R), and basic complexity and \(O(n)\) notation. More formally, we strongly recommend

  • Linear Algebra (MATH 2940 or equivalent). Important topics: inner products, matrix multiplication, singular value decomposition.

  • Probability (ENGRD 2700 or equivalent). Important topics: random sampling, maximum likelihood estimation.

  • Programming (ENGRD/CS 2110 or equivalent). Important topics: basic comfort in a scripting language, iteration, functions.

Syllabus

  • Exploratory data analysis
  • The perceptron algorithm, Support Vector Machine (SVM)
  • Linear regression
  • Feature engineering
  • Train, test, validate, Generalization
  • Singular value decomposition (SVD), quadratic regularization
  • Boostrap and bias-variance tradeoff
  • Trees and forests
  • Various loss functions and regularization
  • Clustering and K-means
  • Contemporary topics in machine learning

Textbooks and readings

We will not require students to purchase any textbook; the information you need to know will be posted as lecture slides or notes. However, we heartily recommend all of the following, and will be drawing on ideas from many of these:

  • Learning from Data, Abu-Mostafa, Magdon-Ismail, and Lin (cheap, and strongly recommended)
  • Foundations of Machine Learning, Mohri, Rostamizadeh, and Talwalkar
  • (free online) Introduction to Statistical Learning, James, Witten, Hastie, and Tibshirani
  • (free online) Feature Engineering and Selection, Max Kuhn and Kjell Johnson
  • (free online) Understanding Machine Learning Shalev-Shwartz and Ben-David
  • (free online) Mining of Massive Datasets, Leskovec, Rajaraman, and Ullman
  • (free online) Foundations of Data Science, Hopcroft and Kannan
  • Artificial Intelligence, Russell and Norvig. http://aima.cs.berkeley.edu/
  • Pattern Recognition and Machine Learning, Bishop
  • Python Tutorial 2021, Derek Banas. https://www.youtube.com/watch?v=H1elmMBnykA
  • Python Machine Learning, Raschka. https://www.amazon.com/Python-Machine-Learning-Sebastian-Raschka/dp/1783555130
  • Linear Algebra and Learning from Data Open Course, Gilbert Strang. https://math.mit.edu/~gs/