-
Notifications
You must be signed in to change notification settings - Fork 21
Server Deployment
Before starting the curation process, ensure that you've normalized as much data as possible. See the Normalize data page for details.
Important: Be sure to read about the performance aspects before you begin.
The curation process takes your normalized data sets then removes duplicates, computes indexes, and sorts the indexes so that they can be searched efficiently. You'll use the leakdb-curator program to do all of this. Let's say you have a single file with all of your normalized data normalized.json:
$ ./leakdb-curator --json normalized.json
This will produce a leakdb/ directory with the results of the process, it should look something like:
drwx------ 2 moloch moloch 4.0K Jul 3 07:45 ./
drwxrwxr-x 7 moloch moloch 4.0K Jul 3 11:58 ../
-rw------- 1 moloch moloch 55G Jul 3 02:53 bloomed.json
-rw-rw-r-- 1 moloch moloch 6.2G Jul 3 08:15 email.idx
-rw-rw-r-- 1 moloch moloch 6.2G Jul 3 07:45 user.idx
-
bloomed.json- The results of the bloom filter (de-duplication) process. -
email.idx- The email index of thebloom.jsonfile -
user.idx- The user index of thebloom.json -
domain.idx- The domain index of thebloom.json, not that by default this file is not generated since it takes a very long time to index domains due to the large number of collisions inherent in this type of data (e.g., lots of people use Gmail).
You can run local queries using the leakdb-curator program too:
$ ./leakdb-curator search --json bloomed.json --index email.idx --value [email protected]
You can provide remote access to the datasets using the leakdb-server:
$ ./leakdb-server --json bloomed.json --index-user ./user.idx --index-email ./email.idx
See leakdb-server --help for more options, the server can be queried via the JSON API or with the leakdb command line client.