Skip to content
This repository was archived by the owner on Dec 19, 2023. It is now read-only.

unum-cloud/RegExum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

234 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RegExum

RegExum is Python wrapper that simplifies text search for Terabyte and Petabyte-scale textual datasets stored in one of the following DBMS:

  • MongoDB - modern (yet mature) distributed document DB with good performance in most workloads
  • ElasticSearch - the go-to DBMS-complementary indexing software for texts and categorical data,
  • PostgreSQL - most feature-rich open-source relational DB,
  • MySQL - the most commonly-used relational DB.

Project Structure

  • regexum - Python wrappers for search-able containers backed by persistent DBs.
  • benchmarks - benchmarking tools and performance results.
  • assets - tiny datasets for testing purposes.

Implementation Details & Included DBs

Some common databases have licences that prohibit sharing of benchmark results, so they were excluded from comparisons.

Name Purpose Implementation Language Lines of Code (in /src/)
MongoDB Documents C++ 3'900'000
Postgre Tables C 1'300'000
ElasticSearch Text Java 730'000
Unum Graphs, Table, Text C++ 80'000

ElasticSearch

  • Java-based document store built on top of Lucene text index.
  • Widely considered high-performance solutions due to the lack of competition.
  • Lucene was ported to multiple languages including projects like: CLucene and LucenePlusPlus.
  • Very popular open-source project backed by the $ESTC publicly traded company.

MongoDB

  • A distributed ACID document store.
  • Internally uses the BSON binary format.
  • Very popular open-source project backed by the $MDB publicly traded company.
  • Provides bindings for most programming languages (including PyMongo for Python).

Postgre, MySQL and other SQLs

  • Most common open-source SQL databases.
  • Work well in single-node environment, but scale poorly out of the box.
  • Mostly store search indexes in a form of a B-Tree. They generally provide good read performance, but are slow to update.

TODO

  • New re.pattern-like object for queries and more list-like interface for DBs:
    • Finding the first match via .index(re.pattern).
    • Streaming all matches via .indexes(re.pattern).
    • Classical methods .append(iterable) and .extend(iterable) for index extension.
  • Mixed Multithreaded Read/Write benchmarks.

About

A Python wrapper for persistent DBMS that simplifies large-scale text search

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors