# Curriculum

### Core courses:

**Probability and Statistics for data analysis (6 units) **- Syllabus

Basic principles of Probabilities. Basic theorems in Probability e.g. law of large numbers, the Central Limit theorem etc. Common probability distributions. Principles of statistics. Data summarization. Statistical inference and causality, Experimental design and sampling methods, Estimation and hypothesis testing. Bootstrap and variants.

**Practical Data Science (6 units) - Syllabus**

The course gives students a set of practical skills for handling data that comes in a variety of formats and sizes, such as texts, spatial and time series data. These skills cover the data analysis lifecycle from initial access and acquisition, modeling, transformation, integration, querying, application of statistical learning and data mining methods, and presentation of results. (The course is hands-on, using python, in iPython interactive computing framework.)

**Large Scale Data Management (6 units) **

Methods and techniques for database design and management, operational data management and transaction processing, data warehouse creation, and information retrieval. New approaches for storage and querying (column stores, NewSQL) will be discussed and experimented upon. Management of large scale structured and unstructured data in different information systems environments.

**Machine Learning and Computational Statistics (7 units) - Syllabus**

Introduction to the basic ideas of statistical learning models (supervised and unsupervised learning). Model selection, feature selection and cross-validation. Linear regression and logistic regression. Generalized linear models. K-nearest neighbor classification, Bayes and naive Bayes classifiers. Kernel Discriminant Analysis and Support Vector Machines. Unsupervised learning methods. Clustering using k-means and mixtures models. The EM algorithm. Dimensionality reduction using PCA, probabilistic PCA, factor analysis and independent component analysis.

**Numerical optimization and Large Scale Linear Algebra (6 units) - ****Syllabus**

Floating point arithmetic; Stability of numerical algorithms; Norms; Fundamentals of matrix theory; Solution of systems of linear equations: direct methods, error analysis, structured matrices; Iterative methods for linear equations and least squares; Eigenanalysis; important matrix factorizations and their algorithms. Application to case studies.

**Data visualization and communication (6 units)** - **Syllabus**

Communicating clearly and effectively about the patterns we find in data is a key skill for a successful data scientist. Visualizations are graphical depictions that can improve comprehension. Collaborative filtering Visualizations will be paired with verbal analyses and reporting. Different tools will be used to transform data and create visualizations, including Python, Google Charts, Tableau, and Spotfire. Assignments will give students experience with reporting on complex patterns and results with graphics and prose.

**Legal, ethical and policy issues in data science (3 units)** -** Syllabus**

Discusses issues of privacy, surveillance, security, classification, discrimination and decisional autonomy from a legal, ethical, and policy perspective (whether business or public policy). Areas of relevance include health, marketing, employment, law enforcement, and education.

### Electives (indicative list):

**Data mining (6 units)**** - Syllabus**

Data-oriented techniques for extracting patterns from data. Association rules, decision trees. Collaborative filtering and recommendation algorithms Finding similar items and frequent itemsets. Mining data streams. Mining social network graphs. Mining for Web advertising. Implementing machine learning schemes.

**Bayesian Statistics and simulation methods (6 units)** - **Syllabus**

Bayesian inference. Simulation and random number generation. Markov models and hidden Markov models. Probabilistic graphical models. Bayesian statistical methods, Markov chain Monte Carlo, Metropolis-Hastings algorithm, Gibbs sampling, sequential Monte Carlo methods, approximate Bayesian computation.

**Advanced Large Scale Data Management (5 units) **

Distributed and parallel data-oriented computation and transaction processing. Integration and management of large scale structured and unstructured data in different information systems environments.

**Big Data Systems and techniques (6 units) - Syllabus**

Techniques and best practices for the development of production Big Data systems using Parquet and ORC columnar storage files in Hadoop and the Apache Spark data processing framework with SQL Query Engines (Spark SQL, Presto). Integration with latest parallel Machine Learning Frameworks. Cloud service technologies like Amazon EMR. Data visualisation technologies. Streaming and realtime processing with Apache Storm + Kafka.

**Statistics for Big data (3 units) - Syllabus **

Small n large p problems, regularizations, model and variable selection techniques, LASSO, elastic net. Multiplicity. Graphical Models. Techniques for sparse matrices and graphical LASSO. Compressed sensing.

**Time series and Forecasting methods (3 units)** - **Syllabus **

Basic principles, autocorrelation and autocovariance, Holt-Winters method, AR, ARMΑ, ARIMA models. Regression models, ARCH – GARCH, volatility models.

**Optimization (5 units)** - **Syllabus**

Linear programming (formulations and algorithms), convex optimization and applications to machine learning (least squares, linear regression, gradient descent, support vector machines), combinatorial optimization (integer programming formulations, branch and bound), local search methods (hill climbing, tabu search, simulated annealing), genetic algorithms.

**Text analytics (6 units)**** - Syllabus**

Language models, text normalization. Applying feature extraction, classification, sequence labeling algorithms (e.g., PCA, naive Bayes, logistic regression, SVMs, HMMs, CRFs) to texts (for document classification, entity recognition etc.). Parsing (CKY, Earley, probabilistic CFGs). Semantics (logic-based, distributional, word embeddings, sense disambiguation) and discourse analysis (co-reference, rhetorical relations). Machine translation. Information extraction (incl., relation extraction) and sentiment analysis. Question answering. Text summarization. Concept-to-text generation. Speech recognition fundamentals.

**Data science and optimization for operations management (5 units)**

Overview of basic concepts from operations management: Process Analysis, queues, inventory management, revenue management. Demand Forecasting. Inventory/Replenishment Optimization. Lead Time Analysis. MRP/Production Planning. Fleet Allocation. Route Optimization

**Marketing and sales analytics (6 units)**

Overview of data mining techniques: clustering, classification, dimensionality reduction, sequence modeling. Techniques for Customer Segmentation. Churn management. Cross-/Up-sell Campaign Targeting. Next Best Action. Marketing Mix optimization. Omni-Channel Optimization. Loyalty Analytics. Basket Analysis

**Data Science for medicine (3 units)** - **Syllabus**

Introduction to epidemiological methods: bias, confounding, sample size. Survival analysis: hazard functions, parameter inference. Methods for categorical data. Analysis of contingency tables, risk assessment in retrospective and prospective studies.

**Data Science for Biology (3 units)** **- Syllabus**

**Information retrieval (3 units) - Syllabus**

Text vocabulary, automatic indexing, inverted files, fast inversion algorithm, index compression. Evaluation of information retrieval systems. Information retrieval models (Boolean model, vector space model, probabilistic retrieval model), latent semantic indexing. Computing scores, result ranking. Crawling. Link analysis. Search engine architecture and systems issues.

**Data curation (3 units)**

Data lifecycle and value chains. Data provenance, curation and preservation: models, practices and tools. Using ontologies and metadata. Data and metadata aggregators and repositories.

**Advanced Econometric Models for Finance (3 units) - Syllabus**

Introduction to the theory and empirical analysis of advanced econometric models to financial applications. Optimal portfolio construction, performance evaluation and forecasting financial time series. Multivariate multifactor models. Multivariate heteroskedastic models. Examples applying these advanced econometric models/techniques to actual financial data using R.

**Data Science Challenge**

**(5 units)**

This course aims at getting the students the students familiar with the integrated workflow of a Data Science (DS) problem. There will be an introduction to the DS methods including data preprocessing, feature selection & engineering, machine learning, graph/text mining and visualization. Next there will be an introduction to the specific data challenge and its domain specificities. The students will have a sufficient time period to work on and provide solutions to the challenge that will be submitted to a platform (such as Kaggle) that enables automated evaluation of predictions for unclassified data. At the end the best solutions will be presented to the class.

**Introduction to Quantitative Finance and Financial Risk Management ****(5 units)** - **Syllabus**

**Online Analytical Processing and Big Data Warehouses (3 units) - Syllabus**

**Social Network Analysis (3 units) - Syllabus**

**Financial Information Systems**** (3 units) - Syllabus**

### Preparatory courses:

**Elements of Statistics and Probability - Syllabus**

**Foundations of Computer Science - Syllabus**

**Math for Data Science - Syllabus**