
This readme file explains how to run FeDis and generate synthetic data that we used in our paper.

==========================================================
(1) Running FeDis
==========================================================
Simply, run a shell script on linux/unix console: 
	./run_fedis.sh

Before running it, the training and testing files should be located in the "input/dataset/<dataset>".
For example, 
	input/dataset/rcv1/rcv1.train
	input/dataset/rcv1/rcv1.test

The script first remove temporary directory in local and hdfs and run DPSCD (=FeDis) code using "hadoop jar". 
The DPSCD requires 6 parameters:
	DPSCD <in_path> <out_path> <hdfsPath> <nreducer> <hditr> <C>
	<inPath>  : input training file in hdfs (e.g., "input/dataset/rcv1/rcv1.train")
	<out_path> : output directory to save the hadoop output files generated in hdfs (e.g., "output/dpscd")
	<hdfsPath>  : absolute address of your home directory in hdfs (e.g., "hdfs://172.0.0.1:9000/user/fedis/")
	<nreducer> : number of reducer to use
	<hditr> : number of hadoop iteration
	<C> : number of coordinates to update for each machine

If you run following command,
	DPSCD input/dataset/rcv1/rcv1.train output/dpscd/ hdfs://172.0.0.1:9000/user/fedis/ 50 100 4
then, you can see hadoop running with likelihood and accuracy printed as follows:
	Running.... [FeDis] rcv1        D=47236 C=4
	14/03/03 23:18:20 INFO hadoop.DPSCD: [FeDis][ITR-1][rcv1] D=47236       C=4
	14/03/03 23:18:20 INFO util.NativeCodeLoader: Loaded the native-hadoop library
	14/03/03 23:18:20 WARN snappy.LoadSnappy: Snappy native library not loaded
	14/03/03 23:18:20 INFO mapred.FileInputFormat: Total input paths to process : 1
	14/03/03 23:18:21 INFO mapred.JobClient: Running job: job_201402142109_5027
	14/03/03 23:18:22 INFO mapred.JobClient:  map 0% reduce 0%
	14/03/03 23:18:35 INFO mapred.JobClient:  map 100% reduce 0%
	14/03/03 23:18:47 INFO mapred.JobClient:  map 100% reduce 15%
	.....
	Calculating likelihood and accuracy...
	Iteration:1     Time:45 Likelihood:-13948.521358        Accuracy:82.734696
	14/03/03 23:20:29 INFO hadoop.DPSCD: [FeDis][ITR-2][rcv1] D=47236       C=4
	...
You can also run with other dataset by changing the <inPath> to news20, kdda or kddb.


==========================================================
(2) Generating synthetic data
==========================================================

Simply, run a shell script on linux/unix console: 
	./run_syn_gen.sh

The script first remove temporary directory in hdfs and run GenSynData code using "hadoop jar". The default number of mapper is set to 100.
The GenSynData requires 8 parameters:
	GenSynData <tempPath> <inPath> <outPath> <hdfsPath> <datasetType> <N> <D> <S>
	<tempPath> : temporary directory in hdfs (e.g., "input/temp")
	<inPath>  : input but empty directory in hdfs for internal files to generate synthetic data (e.g., "input/ds")
	<outPath>  : output directory to save the hadoop output files generated in hdfs (e.g., "input/dataset")
	<hdfsPath>  : absolute address of your home directory in hdfs (e.g., "hdfs://172.0.0.1:9000/user/fedis/")
	<datasetType> : 1 or 2 or 3 (1 for scaling on both data and feature, 2 or 3 for scaling on data or feature) (e.g., 1)
	<N> : Number of data instances (e.g., 480000)
	<D> : Number of feature dimension (e.g., 18000)
	<S> : Sparsity of dataset (e.g., 0.01)

If you run following command,
	hadoop jar DPSCD-0.0.1-SNAPSHOT-r.jar kr.ac.kaist.itcknow.bigml.algo.hadoop.GenSynData input/temp input/ds input/dataset hdfs://172.20.1.1:9000/user/dykang/ 1 480000 18000 0.01
then, you can find following files in your <outPath> directory: 
	input/dataset/1/whole_D_18000_N_480000_S_0.01_NS_-4_DS_-4.txt
	input/dataset/1/whole_D_18000_N_480000_S_0.01_NS_-2_DS_-2.txt
	input/dataset/1/whole_D_18000_N_480000_S_0.01_NS_0_DS_0.txt
	input/dataset/1/whole_D_18000_N_480000_S_0.01_NS_2_DS_2.txt
	input/dataset/1/whole_D_18000_N_480000_S_0.01_NS_4_DS_4.txt
You can also generate other type of synthetic dataset by changing the <datasetType> to 2 or 3.

