| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| phalanx-deidentify.zip | 2019-09-10 | 102.6 MB | |
| README.TXT | 2019-09-07 | 4.7 kB | |
| Totals: 2 Items | 102.6 MB | 0 |
# README.TXT
/*
Phalanx - Deidentify
MIT License
*/
Phalanx
- In Safe Harbor Document Deidentification Mode
Thanks for downloading.
The enclosed software is already built and the resources configured. After running CONFIG.SH, you may be able to run the program.
To build this program you need: g++ and boost libraries greater than version 1.66
To recalculate the supporting resources (dictionaries, rules, etc.) you need: python (os,sys,re,glob,itertools), perl, shasum
Some tools for evalution and debugging require python distutils and Cython
Phalanx is a general purpose, high performance NLP platform for processing all corpora in a folder. It's defining features include: trivial string operation overhead through compiled resources, a minimal token footprint, a fast yet extensive rule engine, and control features that allow for adjustable processing from 1 token at a time to full batch.
It acheives a fast document processing speed by extensive initialization and high memory use. It leaks memory at about 250MB/hour.
To run Phalanx - DeIdentify
Put all the files that you want to process into a directory and link 'Corpora' to that directory
There must be a directory linked to 'Corpora' and that directory must have at least 1 file.
Run the program Phalanx - Deidentify: ./Controller Deidentify_Controller_Node
To use Phalanx - DeIdentify as an API
Put all the files that you want to process into a directory and link 'Corpora' to that directory
There must be a directory linked to 'Corpora' and that directory must have at least 1 file.
C++
void process_all_documents()
void process_next_document()
void reinit_nodes()
void mark_end_of_processing()
void sort_files_in_queue_by_size();
long add_file_to_queue(string text_file);
long add_directory_to_queue(string directory_file);
Create a Controller object and then process documents. Example:
Controller *reader1 = new Controller(argv[1],main_log,&main_log_mark,main_log_max);
reader1->process_all_documents();
OR
Controller *reader1 = new Controller(argv[1],main_log,&main_log_mark,main_log_max);
while(next_file_available) {
reader1->process_next_document();
next_file_available = reader1->get_next_file();
if(next_file_available) {
reader1->reinit_nodes();
}
else {
reader1->mark_end_of_processing();
}
}
Python
This is enabled through Cython. Compiled and linked files are included. It mirrors the C++ API
safe_harbor_deid()
process_all_documents()
reinit_nodes()
mark_end_of_processing()
sort_files_in_queue_by_size();
add_file_to_queue(string);
add_directory_to_queue(string);
import safe_harbor_deid
deid = safe_harbor_deid.safe_harbor_deid()
deid.process_all_documents()
Safe Harbor Deidentification Mode of Phalanx is an abridged pipeline of NLP annotators culminating in NER annotators which write output of text offsets. It uses the Safe Harbor deidentification method.
To run the MIMIC2 corpus through Phalanx - Deidentify:
Download MIMIC2
Copy the corpus file into your CorporaFolder
Run Phalanx - Deidentify
Transform in MIMIC2 format: perl output_mimic2.pl ../Corpora/id.text ../Workspace/PII_Output1.csv > out_mimic.csv
Run your own evalutation or use the one included with MIMIC2 deid
HOW DOES THIS COMPARE TO OTHER DEIDENTIFICATION PROGRAMS
deid is by MIT and bundled with MIMIC2
scrubber is by NIH
MIMIC2
Specificity Sensitivity Time Memory CPU Load
Phalanx - Deidentify .81 .97 19s 850MB 1
deid .75 .97 340s 90MB 1
scrubber ~.25 ~.99 120s 2.1GB 2
i2b2-2014 Deidentification (combined into 1 file)
Specificity Sensitivity Time Memory CPU Load
Phalanx - Deidentify .90 .80 25s 750MB 1
deid .87 .61 647s 90MB 1
scrubber 120s 2.1GB 2
i2b2-2014 Deidentification (514 files)
Specificity Sensitivity Time Memory CPU Load
Phalanx - Deidentify - - 25s 250MB 1
deid - - 845s 90MB 1
scrubber 90s 1GB 2