1. (Q) Why use SWIKI-2011 dataset's results ? 
       Why not SWIKI 2012 ? (or LWIKI 2012) ?
   (A) SWIKI/LWIKI 2012 allowed participants to perform their own preprocessing
       of the text. We did not want this to influence the results.
    
2. (Q) Why does this instance-specific cut-off work ?
   (A) I dont know, it seems to work (work really well).
    
3. (Q) The number of classes reported in the Table is not the same as given in 
       the original data. Why ?
   (A) - We did some pruning. If we have a node 'A' with a single child 'B',
         we collapsed B to A.
	   - If any internal node in the hierarchy had positive examples, we created
	     a new leaf under it and re-assigned all instances to the leaf node.
	   - For all graph based dependencies, if there are two adjacent nodes both 
	     of which have training examples, we created an empty dummy node in 
	     between them.
	     This is to prevent classes from getting directly regularized towards
	     each other, but regularize towards their mean 
	     (the parameters of the dummy node) instead.	   
		
4. (Q) How many "outer" iterations did you run the algorithm ?
   (A) Around 10-12, there was no improvement in training-error.

5. (Q) Are the timing comparisons very accurate ?
   (A) The timings are not 100% accurate. This is because other jobs were also 
       running in the cluster. 

6. (Q) How do you handle large training datasets ?
   (A) Unfortunately, I do not yet have a satisfying answer for this. 
       Currently, the training data must fit in main memory and must be 
       communicated to all cluster nodes before the algorithm starts. But you
       can try the following,
       a. Store the data more wisely. If text, instead of using floating point 
       	  numbers you can use discrete counts and do the "ltc" normalization on
       	  the fly. 
       b. Compress the data and see if it fits.        
       c. If the data still does not fit, try stochastic gradient descent to 
          solve each inner iteration.