This dataset was used in the blogpost here.

It's derived from the SEC 10K dataset of Kogan, Levin, Routledge, Sagan and Smith 2009: www.ark.cs.cmu.edu/10K/