Code + Data

Stage 1: Building the cue lexicon

In stage 1, the cue lexicon is built by running SAGE on ~120 labeled books and magazines.

Steps

Obtain ark-sage-0.1.jar from here. ark-sage is a Java library I have written to make it easier to perform SAGE inference on text. We will be using the SupervisedSAGE module of the library to perform inference on the ideological book corpus.
Preparing the data. SupervisedSAGE takes as input data of the format
```
label1 label2 ... labelN<TAB>token1:count1 token2:count2 ... tokenN:countN
label1 label2 ... labelN<TAB>token1:count1 token2:count2 ... tokenN:countN
...
```
In our case, token are bigrams and trigrams (words separated by _), which have undergone typical NLP text processing (stemming, removing punctuations, stopwords, vocabulary reduction, etc).

Due to the nature of our data, we are not able to provide the actual text of the books/magazines. Instead, you can find the following labeled input (with a reduced vocabulary of 28,731 bigrams/trigrams) to SupervisedSAGE here:
1. Ideological corpus with ideology labeling (7.5MB)
2. Ideological corpus with ideology and topic labeling (7.5MB)
Furthermore, we have normalized the counts of terms on a sentence level, i.e each token in a sentence of length N has count 1/N.
Extracting the cue lexicon can be accomplished by running
```
supervised-SAGE.sh --input-counts no_uni+stem+no_stopw+topics.word_counts --output no_uni+stem+no_stopw+topics-30.sage --iterations 20 --config-file l1_weights
```
where the file l1_weights contains the regularization weight (λ in the paper) for each label (effect) in the form --l1-weight <label>=<value>.

The weights file that we used for the paper (λ=30): without topics, and with topics.
The output cue lexicon used in our paper: SAGE file and as a list of terms. You can also explore the terms here.

Stage 2: Cue-Lag Ideological Proportions Model

In stage 2, we use the cue lexicon found in stage 1 to perform inference on candidates speeches to obtain their ideological proportions.

Steps

Preparing the candidate speeches. The original collection of candidate speeches are first pre-processed (tokenized, stemmed, normalized, etc). [stemmed speeches]

Using the stemmed speeches and cue lexicon as input, we build the cue-lag representation using these (hacky) Python scripts: create-model-terms.py and create-lag-data.py.

python create-model-terms.py ${SAGE_FILE} > ${MODEL_FOLDER}/terms.sage
python create-lag-data.py ${TERMS_FILE} ${SPEECH_FOLDER} ${MODEL_FOLDER}

where ${SAGE_FILE} and ${TERMS_FILE} are the output of stage 1 as a ASGE file and a list of terms respectively. ${SPEECH_FOLDER} and ${MODEL_FOLDER} are the folders containing stemmed speeches and folder to output the cue-lag representation to respectively.

create-model-terms.py creates a tabular file where each row is a cue term and each column correspond to weights of the term under each ideology. create-lag-data.py converts tokenized .txt files (tokens separated by space, a sentence on ach line) into ".lag" files, which look like

__START_OF_SPEECH__	0
william_penn	17
found_father	4
bear_wit	13
unit_state	71
state_capitol	12
ben_franklin	29
unit_state	30
presid_obama	27
social_medicin	3
entitl_spend	36
social_engin	6
taxpay_fund	54
deepli_troubl	2
american_citizen	16
polit_power	51
repeal_obamacar	47
republican_senat	14
turn_point	12
futur_gener	26
unit_state	67
state_senat	0
barack_obama	17
obama_polici	0
live_free	31
croni_capit	71
govern_regul	7
tax_code	3
american_energi	11
energi_product	0
american_famili	3
tradit_marriag	13
religi_liberti	9
american_peopl	10
presid_obama	143
liber_polici	3
repeal_obamacar	8
american_peopl	21
repeal_obamacar	98
american_peopl	59
full_measur	25
straw_poll	49
south_carolina	86
god_plan	72
presidenti_campaign	54
god_bless	4
unit_state	6
__END_OF_SPEECH__	3
__SPEECH_LENGTH__	1457

You can get the candidate speeches in cue-lag format here.

Running CLIP. The Java sampler code for CLIP, along with some configuration files is available here. To run it on a candidate (do edit run-model.sh to point to the correct directories, they are currently hardcoded to my working directory),
```
./run-model.sh --config-file candidates.settings --output-dir sampler_output --data-dir model-data/bachmann-no_uni+stem+no_stopw+topics-30 --terms-file model-data/bachmann-no_uni+stem+no_stopw+topics-30/terms.sage
```
In the sampler code directory, one can find several .settings file, which specify parameters for the Gibbs sampler such as iterations, initial hyperparameters, etc. There are also separate .settings files for the other baselines used in the paper.
After the sampler finishes, the individual samples can be found in sampler_output/samples as individual .gz, containing tab-separated lines for every speech with (ideology, restart) sampled for every term.

It should be straightforward to write scripts in your favorite language to process these samples and perform the needed analysis. I am not posting the analysis scripts (including those for creating this website) here as they are too "hardcoded" and hacky to be useful anywhere else outside of my work directory.

For the posterior samples, or any other data/requests that you may need/have, feel free to contact me here.