RegExum

RegExum is Python wrapper that simplifies text search for Terabyte and Petabyte-scale textual datasets stored in one of the following DBMS:

MongoDB - modern (yet mature) distributed document DB with good performance in most workloads
ElasticSearch - the go-to DBMS-complementary indexing software for texts and categorical data,
PostgreSQL - most feature-rich open-source relational DB,
MySQL - the most commonly-used relational DB.

Project Structure

regexum - Python wrappers for search-able containers backed by persistent DBs.
benchmarks - benchmarking tools and performance results.
assets - tiny datasets for testing purposes.

Implementation Details & Included DBs

Some common databases have licences that prohibit sharing of benchmark results, so they were excluded from comparisons.

Name	Purpose	Implementation Language	Lines of Code (in `/src/`)
MongoDB	Documents	C++	3'900'000
Postgre	Tables	C	1'300'000
ElasticSearch	Text	Java	730'000
Unum	Graphs, Table, Text	C++	80'000

ElasticSearch

Java-based document store built on top of Lucene text index.
Widely considered high-performance solutions due to the lack of competition.
Lucene was ported to multiple languages including projects like: CLucene and LucenePlusPlus.
Very popular open-source project backed by the $ESTC publicly traded company.

MongoDB

A distributed ACID document store.
Internally uses the BSON binary format.
Very popular open-source project backed by the $MDB publicly traded company.
Provides bindings for most programming languages (including PyMongo for Python).

Postgre, MySQL and other SQLs

Most common open-source SQL databases.
Work well in single-node environment, but scale poorly out of the box.
Mostly store search indexes in a form of a B-Tree. They generally provide good read performance, but are slow to update.

TODO

New re.pattern-like object for queries and more list-like interface for DBs:
- Finding the first match via .index(re.pattern).
- Streaming all matches via .indexes(re.pattern).
- Classical methods .append(iterable) and .extend(iterable) for index extension.
Mixed Multithreaded Read/Write benchmarks.

Name		Name	Last commit message	Last commit date
Latest commit History 234 Commits
assets		assets
benchmarks		benchmarks
regexum		regexum
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
bench.sh		bench.sh
build.sh		build.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RegExum

Project Structure

Implementation Details & Included DBs

ElasticSearch

MongoDB

Postgre, MySQL and other SQLs

TODO

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RegExum

Project Structure

Implementation Details & Included DBs

ElasticSearch

MongoDB

Postgre, MySQL and other SQLs

TODO

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages