1. (Q) Why use SWIKI-2011 dataset's results ? Why not SWIKI 2012 ? (or LWIKI 2012) ? (A) SWIKI/LWIKI 2012 allowed participants to perform their own preprocessing of the text. We did not want this to influence the results. 2. (Q) Why does this instance-specific cut-off work ? (A) I dont know, it seems to work (work really well). 3. (Q) The number of classes reported in the Table is not the same as given in the original data. Why ? (A) - We did some pruning. If we have a node 'A' with a single child 'B', we collapsed B to A. - If any internal node in the hierarchy had positive examples, we created a new leaf under it and re-assigned all instances to the leaf node. - For all graph based dependencies, if there are two adjacent nodes both of which have training examples, we created an empty dummy node in between them. This is to prevent classes from getting directly regularized towards each other, but regularize towards their mean (the parameters of the dummy node) instead. 4. (Q) How many "outer" iterations did you run the algorithm ? (A) Around 10-12, there was no improvement in training-error. 5. (Q) Are the timing comparisons very accurate ? (A) The timings are not 100% accurate. This is because other jobs were also running in the cluster. 6. (Q) How do you handle large training datasets ? (A) Unfortunately, I do not yet have a satisfying answer for this. Currently, the training data must fit in main memory and must be communicated to all cluster nodes before the algorithm starts. But you can try the following, a. Store the data more wisely. If text, instead of using floating point numbers you can use discrete counts and do the "ltc" normalization on the fly. b. Compress the data and see if it fits. c. If the data still does not fit, try stochastic gradient descent to solve each inner iteration.