Skip to content

Server Deployment

Joe edited this page Jul 6, 2020 · 8 revisions

Curating Datasets

Before starting the curation process, ensure that you've normalized as much data as possible. See the Normalize data page for details.

Important: Be sure to read about the performance aspects before you begin.

The curation process takes your normalized data sets then removes duplicates, computes indexes, and sorts the indexes so that they can be searched efficiently. You'll use the leakdb-curator program to do all of this. Let's say you have a single file with all of your normalized data normalized.json:

Basic Example
$ ./leakdb-curator --json normalized.json

This will produce a leakdb/ directory with the results of the process, it should look something like:

drwx------ 2 moloch moloch 4.0K Jul  3 07:45 ./
drwxrwxr-x 7 moloch moloch 4.0K Jul  3 11:58 ../
-rw------- 1 moloch moloch  55G Jul  3 02:53 bloomed.json
-rw-rw-r-- 1 moloch moloch 6.2G Jul  3 08:15 email.idx
-rw-rw-r-- 1 moloch moloch 6.2G Jul  3 07:45 user.idx
  • bloomed.json - The results of the bloom filter (de-duplication) process.
  • email.idx - The email index of the bloom.json file
  • user.idx - The user index of the bloom.json
  • domain.idx - The domain index of the bloom.json, not that by default this file is not generated since it takes a very long time to index domains due to the large number of collisions inherent in this type of data (e.g., lots of people use Gmail).

You can run local queries using the leakdb-curator program too:

$ ./leakdb-curator search --json bloomed.json --index email.idx --value [email protected]

LeakDB API Server

You can provide remote access to the datasets using the leakdb-server:

$ ./leakdb-server --json bloomed.json --index-user ./user.idx --index-email ./email.idx

See leakdb-server --help for more options, the server can be queried via the JSON API or with the leakdb command line client.

Clone this wiki locally