Tool for anonymizing emails
Welcome. This is the webpage hosting a free tool based on Java for anonymizing emails. The text below would guide you about the usage of the tool.
Assumptions:
- The emails are short (we have only used emails of size less than 12 KB), and are in normal text form. A MimeMessage email can be stripped of some headers and attachments to create such a text form. In our research, we have only kept the "From", "To", "Cc", "Subject", "Date" and the body text from original MimeMessages. Here is an example email.
- A running MySql database. A table called "temporary" has to be created that stores anonymization tags. Here is an sql command file that shows the schema. This database table should be created before the anonymization tool is run.
Software Requirements:
- Java Runtime Enviroment >= 1.5
- A set of libraries (versions included in the download zip package): JavaMail, JDBC driver for MySql and Log4j.
Anonymization Protocol:
- This tool can be used to anonymize a related set of emails by multiple people simultaneously.
- Let us assume that there are a set of entity types that need to be anonymized for a domain (types like FULL_NAME, LOCATION, EMAIL, PHONE_NUMBER, FIRST_NAME and so on). Different entities often relate to the same object or person in a domain. For example, the strings "Dipanjan", "Dipanjan Das" and "dipanjan AT cs DOT cmu DOT edu" relate to me. In our anonymization protocol, we decided to keep the relation between these entities intact so that mapping the entities back to a set of fictitious person/objects at a later step creates a realistic scenario.
- The anonymization process at any stage of anonymization proceeds as follows:
- There exists a finalized tagset that all anonymizers agree on. This finalized tagset has mapping between actual token sequences to some entity types of the form ENTITY_NNNN+EMMM.
- In ENTITY_NNNN+EMMM, ENTITY refers to a domain entity (e.g. FULL_NAME), NNNN is the global index of this entity. In other words, for the first entity in the finalized tagset NNNN=0001, for the 100th entity, NNNN=0100. Note that we can have only 9999 unique entity tags for a domain. However, this constraint can be easily changed by modifying the source code.
- In ENTITY_NNNN+EMMM, MMM refers to the MMMth unique object in the domain. For example if I am the first unique object in the domain, all entity tags that refer to me in the tagset will have MMM=001. We have MMM=000 for entities that will not have different tags that relate to them. Again, note that we can have 999 unique objects in the domain, and this can be changed by modifying the code.
- The finalized tagset of the above form is frozen during one iteration of anonymization and is used by all anonymizers working on the dataset.
- There exists another temporary tagset that is shared between the anonymizers. This mapping between tags and real token sequence exists in a database table that all anonymizers have access to. The tags have the form ENTITY_NNNN+T000. Again, ENTITY and NNNN have the same meaning as above. To explain more, if there are 100 such temporary tags in the database, NNNN will range from 0001 to 0100.
- This temporary tagset is maintained so that same token sequences do not get different tags by different anonymizers.
- The anonymization tool uses the finalized tagset and the tagset from the database to mask token sequences that are present in the unified mapping. The user can unmask unwanted maskings very easily using the UI. This will be further explained in the tool description.
- At the end of an iteration of anonymization, all anonymizers or one leading anonymizer extracts the temporary tags from the database, appends those to the global tagset and renames them semi-automatically to get a larger finalized tagset. The temporary database is flushed, the emails containing the temporary tags are processed to replace the temporary tags with final tags. The process is started from Step 1 again.
Tool Description:
- The tool can be downloaded from here.
- There are three configuration files in the zip, that need to be modified:
- db.properties: Describes the database config. Here is an example file.
- hard.disk.emails.properties: Describes the configuration of various input parameters to the tool. Here is an example file. Please click here for further details about this file.
- log4j.properties: Describes the log4j configuration.
- The zip contains example files "newmapping.txt" and "outmap.txt", that correspond to the properties "email.org.map" and "email.input.map" respectively in the properties file "hard.disk.emails.properties". You can also see them here and here. Note:
- The perl script newmapping_to_outmap.pl can be used to convert newmapping.txt to outmap.txt.
- newmapping.txt is hand editable has "\f\n" has the separator between token-entity pairs. To refer to the same object in the domain, one can just type in the entity followed by a plus sign and a code of the object (e.g. FULL_NAME+das can refer to "dipanjan das". This is translated to the actual finalized tag (say, to FULL_NAME_0002+E001) in outmap.txt by the perl script.
- The following set of operations constitute of one iteration of anonymization:
- Run run_anonymization_ui.bat: this is the program that runs the anonymization UI. The hash present in outmap is used as permanent mappings. If you create new replacements, they will go into the database, which should initially be empty. You can continue doing this till you have enough tags in the database. Click here to see how to use the interface.
- Run run_get_temp_map_from_db.bat: this should be run when you want to take the temporary tags out of the database and append them to the permanent tagset. The program gets the temporary database entries and creates the file defined by the property "email.temp.map". It also appends the newmapping.txt file with the new entry types and the actual string. Please note that running this twice for the same database config will result in appending duplicate entries to the newmapping.txt file. This should be avoided.
After this, you should hand edit the newmapping.txt file by adding +unique_object_code) at the end of entity tags, if any. After that, run the newmapping_to_outmap.pl script to get a new outmap.txt file having the new permanent tagset for the next iteration.
- Run create_temp_to_new_map.bat: this program takes the new entries in the newmapping.txt file, and creates a map between the temporary replacement tags in "email.temp.map" and the new entries in "outmap.txt". You need to edit the arg in the bat file. The entry should denote the first new tag number. For example, if the last entry of the outmap.txt was PROJECT_NAME_461+E000 before running newmapping_to_outmap.pl, then should be 462. Therefore, all the new entries in outmap,txt from _462+E to the end will have a mapping to the temporary tags in "email.temp.map". This should be carefully done. The map produced is the file corresponding to "email.temp.to.new.map"
- Run retag_emails.bat: This script just uses "email.temp.to.new.map" to map the temporary tags to the new tags in the emails that contain the temporary tags.
- Run flush_temporary_database.bat: DANGER!!! This script flushes the temporary database. You should run this script when you don't need the temporary tags anymore.
Licensing: University of Illinois/NCSA Open Source License
Send email to Email: dipa...@cs.cmu.edu for any questions about running the tool.
Back to Dipanjan Das' homepage