Discovering Patterns and Relationships among Investors, Startups and Beyond

In the past decade, information technology has spawned a great number of companies with wonderful products that make our lives more entertaining and convenient. Naturally, it is very meaningful to conduct data analysis in the startup arena to find interesting patterns, such as which companies are more likely to go IPO or get acquired, which domains yield more successful products or companies in a particular time frame, who tend to invest in which companies, and so on. We have done some research along those lines in the recent past, and we would like to share our shallow thoughts with you in this document.

TechCrunch released a public CrunchBase corpus in a downloadable form on June 6, 2013, and you may want to explore their data set as well. However, this article is still useful in that it gives you a sense of our ideas and what we have done.

Corpus Download Acquisition Prediction More Ideas


A large corpus with high-quality data is an essential step toward any data mining task, and to that ends, we utilized the public data set provided by CrunchBase in our research. This is a great data source with the following benefits:

  1. Accessibility. This is a public data set with free CrunchBase API to facilitate crawling.
  2. Diversity. The CrunchBase corpus contains structured data for a variety of categories including "companies", "people", "financial organizations", "service providers", "funding rounds" and "acquisitions", as well as unstructured data such as the TechCrunch news articles for a certain companies, people, products, etc.
  3. Volume. This is among the largest corpora in the tech world. As of December 11, 2012, CrunchBase has profiles for 106,802 companies, 140,712 people, 8,557 financial organizations, 5,007 service providers, 32,044 funding rounds and 7,385 acquisitions. These numbers will keep growing due to the open nature of the CrunchBase corpus.
  4. Openness. CrunchBase is essentially a wiki-like corpus, which relies on the web community to edit most of its pages (though profiles for large companies like Facebook and Google are not editable by the general public).


Corpus Statistics and Links

To better illustrate our research result and get you started in this area quickly (if you have interests of course), we share within Carnegie Mellon our corpus that we collected from CrunchBase and TechCrunch in mid 2012. If you do not have a CMU IP and yet want to quickly play with the data for research purpose only, send me an email.


Company profiles



Person profiles



Financial organization profiles


Financial organizations

Product profiles



Service provider profiles


Service providers

News articles

TechCrunch8,689 (#companies)

News articles

News article urls

TechCrunch8,605 (#companies)

News article urls


Data Format and Scripts

CrunchBase assigns a unique ID called permalink to each entity. For example, the permalink for Accel Partners is accel-partners. In the CrunchBase corpus we provide above, all file names are permalinks. For the package with news article urls, each file has on each line a date followed by the URL of the corresponding TechCrunch article, with the file name being a permalink. The data set with TechCrunch articles is structured somewhat differently. Each company has a folder named by its permalink. Inside each folder, there are one or more files with each representing a TechCrunch article about the corresponding company and named by the MD5 hash of the article url.

The urls and TechCrunch articles are stored as plain text in our corpus, while the JSON entity profiles from CrunchBase were saved by the dump() method of the pickle module in Python. For the latter, you can simply use this code to load a JSON object from each pickled entity profile file.


Acquisition Prediction

In this work, we examined the task of Merger and Acquisition (M\&A) prediction, which has been an interesting and challenging research topic in the past a few decades. Specifically, we used the profiles and news articles for companies and people on TechCrunch, and explored topic features via topic modeling techniques, as well as a set of other novel features of our design within a machine learning framework. We conducted experiments of the largest scale in the literature, and achieved a high true positive rate (TP) between 60% to 79.8% with a false positive rate (FP) mostly between 0% and 8.3% over company categories with a small number of missing attributes in the CrunchBase profiles. Please refer to our paper [short][long] for more details.

More Ideas

The work introduced above is just our first step, and there are actually a lot more interesting ideas that could be explored on the CrunchBase data set. If you have some thoughts and would like to share with us, we are more than happy to hear from you.