This data is used in our AAAI 2016 paper, with title "Distant IE by Bootstrapping Using Lists and Document Structure" (Bing et al 2016). The data in this release has four parts: input corpora, Freebase seeds, labeled evaluation data, and BioASQ queries. I did not get time to tidy the data from intermediate steps of our framework, and will try to get it released later. I would suggest users to go through our paper first to get familiar with the terminologies, which will be used below. Feel free to contact the authors for any unclear issue, and please cite our paper if you use this data in your works. 1 CORPORA (in corpora/) We have four corpora here: target drug corpus DailyMed, target disease corpus WikiDisease, structured drug corpus WebMD, and structured disease corpus MayoClinic. 1.1 How the data is collected - DailyMed is downloaded from dailymed.nlm.nih.gov which contains 28,590 XML documents, each of which describes a drug. - WikiDisease, is extracted from a Wikipedia dump of May 2015 and it contains 8,596 disease articles. To extract these articles form the dump, we first collect the disease names from a few sources, such as ICD category, disease list page in Wikipedia, DBpedia category information, and isA relation in Freebase. Then using this list to extract the corresponding articles from the dump. We should say that WikiDisease contains some articles that are not about disease, due to noise in the disease name list and simple extracting strategy. - WebMD is collected from www.webmd.com, and each drug page has the same 7 sections, such as Uses, Side Effects, Precautions, etc. It contains 2,096 pages. - MayoClinic is collected from www.mayoclinic.org. The sections of MayoClinic pages include Symptoms, Causes, Risk Factors, Treatments and Drugs, Prevention, etc. It contains 1,117 pages. 1.2 Data files (in corpora/CORPUS_NAME) In each folder, named with corpus name, you will find the following files: - all_para_sent_code.txt: the unique sentences, and each has an ID - all_para_ss_code.xml: all documents of the corpus, and the sentences are replaced with the sentence ID in all_para_sent_code.txt. The root node is , and each doc is a subnode . The subnodes under are self-explanatory. Users should note such a pattern "## WARNING: Respiratory ...", here "WARNING: Respiratory ..." is a section title and always follows "## ", and node name "warning" is generated from the section title with rules. If one wants to recover the original document contents, just replace the sentence ID with its content in all_para_sent_code.txt. 2 SEED (in seeds/) The seeds are extracted from a snapshot of Freebase downloaded in July 2015. In this folder, you will find four files. - disease_seeds.txt: all 18,082 triples of five relations in freebase, treatments, symptoms, risk factors, causes, and prevention factors. - disease_seeds_used_single.txt: the seeds that exist in WikiDisease corpus and belong only to one relation. Note that only part of all seeds can be matched with our corpus (subject matches the document subject, and object appears in the document content). In the paper, we only use the seeds belong to a single relation. - drug_seeds.txt: all 7803 triples of three relations, used to treat, conditions this may prevent, and side effects. - drug_seeds_used_single.txt: the seeds that exist in DailyMed. Each line in these files follows format: seed RELATION SUBJECT@OBJECT We use SecondString (Cohen, Ravikumar, and Fienberg 2003) to match the seeds against the corpora. After matching, we revise the string of a matched seeds as follows (just a trick for convenience later). For example, if sideEffects( morphine_sulfate,headaches) in drug_seeds.txt exists in DailyMed, but the object "headaches" matches with a string "headache" in "Morphine Sulfate" document, we revise the seed as sideEffects(morphine_sulfate,headache) in the file drug_seeds_used_single.txt. 3 EVALUATION DATA (in eval_data/) we manually labeled 10 pages from WikiDisease corpus and 10 pages from DailyMed corpus. The annotated text fragments are those NPs that are object values of those 8 relations, with the drug or disease described by the corresponding document as the relation subject. In total, we collected 436 triple facts for disease domain and 320 triples facts for drug domain. 3.1 Label triples (in eval_data/triples) In this folder, you will find two files: disease_anno_trec_eval and drug_anno_trec_eval. The labeled triples are formatted for evaluation with trec_eval package. Each line follows: SUBJECT_RELATION Q0 OBJECT 1 One example is "abscess_causes Q0 bacterial_infection 1", where "abscess" is the disease name, "causes" is the relation name, and "bacterial_infection" is the object value. 3.2 Labeled pages (in eval_data/labeled_pages) We also share the labeled pages. This folder has two subfolders, disease and drug, each has the same files as in a subfolder of "corpora/". In addition, we also give the original doc content in "all.xml" files. 4 BIOASQ DATA (in bioASQ/) - refined_eight_bioasq3b.examples: it contains the 58 queries used in our paper. - rules_eight.ppr: the rules to answer those queries by ProPPR. REFERENCE Lidong Bing, Mingyang Ling, Richard Wang, and William Cohen. 2016. Distant ie by bootstrapping using lists and document structure. AAAI 2016. Cohen, W. W.; Ravikumar, P.; and Fienberg, S. E. 2003. A comparison of string distance metrics for name-matching tasks. In IIWeb-03.