Code + Data

Stage 1: Building the cue lexicon

In stage 1, the cue lexicon is built by running SAGE on ~120 labeled books and magazines.

Steps

  1. Obtain ark-sage-0.1.jar from here. ark-sage is a Java library I have written to make it easier to perform SAGE inference on text. We will be using the SupervisedSAGE module of the library to perform inference on the ideological book corpus.

  2. Preparing the data. SupervisedSAGE takes as input data of the format

    label1 label2 ... labelN<TAB>token1:count1 token2:count2 ... tokenN:countN
    label1 label2 ... labelN<TAB>token1:count1 token2:count2 ... tokenN:countN
    ...
    In our case, token are bigrams and trigrams (words separated by _), which have undergone typical NLP text processing (stemming, removing punctuations, stopwords, vocabulary reduction, etc).

    Due to the nature of our data, we are not able to provide the actual text of the books/magazines. Instead, you can find the following labeled input (with a reduced vocabulary of 28,731 bigrams/trigrams) to SupervisedSAGE here:

    1. Ideological corpus with ideology labeling (7.5MB)
    2. Ideological corpus with ideology and topic labeling (7.5MB)

    Furthermore, we have normalized the counts of terms on a sentence level, i.e each token in a sentence of length N has count 1/N.

  3. Extracting the cue lexicon can be accomplished by running

    supervised-SAGE.sh --input-counts no_uni+stem+no_stopw+topics.word_counts --output no_uni+stem+no_stopw+topics-30.sage --iterations 20 --config-file l1_weights
    where the file l1_weights contains the regularization weight (λ in the paper) for each label (effect) in the form --l1-weight <label>=<value>.

    The weights file that we used for the paper (λ=30): without topics, and with topics.

  4. The output cue lexicon used in our paper: SAGE file and as a list of terms. You can also explore the terms here.


Stage 2: Cue-Lag Ideological Proportions Model

In stage 2, we use the cue lexicon found in stage 1 to perform inference on candidates speeches to obtain their ideological proportions.

Steps

  1. Preparing the candidate speeches. The original collection of candidate speeches are first pre-processed (tokenized, stemmed, normalized, etc). [stemmed speeches]

    Using the stemmed speeches and cue lexicon as input, we build the cue-lag representation using these (hacky) Python scripts: create-model-terms.py and create-lag-data.py.

    python create-model-terms.py ${SAGE_FILE} > ${MODEL_FOLDER}/terms.sage
    python create-lag-data.py ${TERMS_FILE} ${SPEECH_FOLDER} ${MODEL_FOLDER}
    where ${SAGE_FILE} and ${TERMS_FILE} are the output of stage 1 as a ASGE file and a list of terms respectively. ${SPEECH_FOLDER} and ${MODEL_FOLDER} are the folders containing stemmed speeches and folder to output the cue-lag representation to respectively.

    create-model-terms.py creates a tabular file where each row is a cue term and each column correspond to weights of the term under each ideology. create-lag-data.py converts tokenized .txt files (tokens separated by space, a sentence on ach line) into ".lag" files, which look like

    __START_OF_SPEECH__	0
    william_penn	17
    found_father	4
    bear_wit	13
    unit_state	71
    state_capitol	12
    ben_franklin	29
    unit_state	30
    presid_obama	27
    social_medicin	3
    entitl_spend	36
    social_engin	6
    taxpay_fund	54
    deepli_troubl	2
    american_citizen	16
    polit_power	51
    repeal_obamacar	47
    republican_senat	14
    turn_point	12
    futur_gener	26
    unit_state	67
    state_senat	0
    barack_obama	17
    obama_polici	0
    live_free	31
    croni_capit	71
    govern_regul	7
    tax_code	3
    american_energi	11
    energi_product	0
    american_famili	3
    tradit_marriag	13
    religi_liberti	9
    american_peopl	10
    presid_obama	143
    liber_polici	3
    repeal_obamacar	8
    american_peopl	21
    repeal_obamacar	98
    american_peopl	59
    full_measur	25
    straw_poll	49
    south_carolina	86
    god_plan	72
    presidenti_campaign	54
    god_bless	4
    unit_state	6
    __END_OF_SPEECH__	3
    __SPEECH_LENGTH__	1457

    You can get the candidate speeches in cue-lag format here.

  2. Running CLIP. The Java sampler code for CLIP, along with some configuration files is available here. To run it on a candidate (do edit run-model.sh to point to the correct directories, they are currently hardcoded to my working directory),

    ./run-model.sh --config-file candidates.settings --output-dir sampler_output --data-dir model-data/bachmann-no_uni+stem+no_stopw+topics-30 --terms-file model-data/bachmann-no_uni+stem+no_stopw+topics-30/terms.sage

    In the sampler code directory, one can find several .settings file, which specify parameters for the Gibbs sampler such as iterations, initial hyperparameters, etc. There are also separate .settings files for the other baselines used in the paper.

  3. After the sampler finishes, the individual samples can be found in sampler_output/samples as individual .gz, containing tab-separated lines for every speech with (ideology, restart) sampled for every term.

    It should be straightforward to write scripts in your favorite language to process these samples and perform the needed analysis. I am not posting the analysis scripts (including those for creating this website) here as they are too "hardcoded" and hacky to be useful anywhere else outside of my work directory.

    For the posterior samples, or any other data/requests that you may need/have, feel free to contact me here.