Data size
intro
Data Validation
- We work with big data, and we often move data around (local <-> cluster/remote). We need to make sure data isnt corrupted due to network/disk errors. My way to do so is
- provide a seed to the random number generator and sample data
- run md5sum
- Do above a couple of times with different seeds and if each time md5 is same chances are data are identical
Data preprocessing
Dimension reduction
- Principal Component Analysis:
- Linear Discriminent Analysis: http://sebastianraschka.com/Articles/2014_python_lda.html
- Feature selection: http://sebastianraschka.com/Articles/2014_sequential_sel_algos.html