Novel Table Search (NTS)

Overview

This repository contains the code and experiments for our paper on Novel Table Search (NTS) over data lakes. Given a query table and a set of unionable candidate tables, NTS aims to find tables that are not only unionable but also novel, i.e., they add new, non-redundant information instead of repeating what is already in the query or in other selected tables. We formalize NTS, propose a syntactic novelty scoring function nscore that penalizes duplicate content, and show that optimizing it exactly is NP-hard. To address this, we introduce our main approximation algorithm ANTS, along with three additional methods (GMC, ERNov, and SemNov) that capture syntactic or semantic novelty in different ways. The repo includes implementations of all methods, scripts to reproduce our experiments on recent unionable-table benchmarks (e.g., Ugen-v2), and code for our downstream ML task where we show that novelty-aware table selection can improve model performance.

Technical Report

For detailed technical information about the algorithms, metrics, and experimental setup, see NTS_Technical_Report.pdf.

Core Components

Search Systems: Seven implementations of novelty-aware table search algorithms:

ANTs (Penalized_search.py) - Penalizes redundant tables using diversity-aware scoring
GMC (GMC_search.py) - Graph-based matching with column-level similarity
Starmie (Starmie_search.py) - Baseline contrastive learning approach
SemNov (Semantic_Novelty_search.py) - Semantic novelty maximization
StarmieGold (Starmie0_search.py) - Enhanced Starmie with ground truth alignment
StarmieOne (Starmie1_search.py) - Single-alignment Starmie variant
ER (entitymatching/) - Entity resolution-based table search with ScaNN blocking

Evaluation Metrics: We introduce three metrics for measuring syntactic novelty in search results:

SNM (Syntactic Novelty Measure) – used for systems that output ranked results (ANTS, Semnove, ER).
SSNM (Simplified SNM) – a simplified variant of SNM for systems without ranking (GMC).
N-score (Novelty Score) – computes the novelty of the returned results, enabling a direct comparison of how effectively different systems achieve novelty.

Data Processing and Evaluation Pipeline:

Alignment Generation (preprocess_align_parallelize.py) - Automatic column alignment using COP-KMeans clustering
Evaluation Tools (evaluate_novelty_plus.py) - Configurable N-score computation

Experimental Frameworks:

Entity Matching (entitymatching/) - ScaNN-based blocking, matching, and ranking. This is used in our ER system.
Initial Search Impact (initial_search_impact/) - Analysis of search composition effects
Downstream Evaluation (downstream_eval/) - ML-based task evaluation
Ablation Studies (ablation_study_d_p/) - Parameter sensitivity analysis for domain size and penalty
Diversity Maximization (GMM/) - A Max–Min diversification baseline with approximation guarantees (https://web.archive.org/web/20170809211944id_/http://www.cs.toronto.edu/~eidan/papers/huristics.pdf), used to compare against the novelty score (N-score).

Datasets

The repository also includes the code developed for constructing the customized benchmark datasets and performing data dilution. The datasets, along with various data representations and ground truth files, are shared at the following link: https://zenodo.org/records/15350355

Dependencies

See requirements.txt for a complete list of Python dependencies.

To install all dependencies:

pip install -r requirements.txt

Key Requirements:

Python 3.8+
PyTorch 1.10+
scikit-learn
transformers
sentence-transformers
pandas, numpy, scipy
xgboost, lightgbm
matplotlib, seaborn

Repository Structure

NTS/
├── Penalized_search.py              # ANTs system implementation
├── GMC_search.py                    # GMC system implementation
├── Starmie_search.py                # Starmie baseline search
├── Starmie0_search.py               # StarmieGold system
├── Starmie1_search.py               # StarmieOne system
├── Semantic_Novelty_search.py       # SemNov system implementation
├── evaluate_novelty_plus.py         # Evaluation metrics (SNM, SSNM, N-Score)
├── preprocess_align_parallelize.py  # Automatic column alignment generation
├── utilities.py                     # Shared utility functions
├── config/                          # Configuration files for evaluation
├── entitymatching/                  # ScaNN-based blocking, matching, and ranking pipeline
├── initial_search_impact/           # Initial search composition impact analysis
├── downstream_eval/                 # Downstream task evaluation with ML models
├── ablation_study_d_p/              # Ablation study on domain size (dsize) and penalty degree (p) parameters
├── GMM/                             # Greedy Maximum-Minimum diversity maximization
└── data/                            # Dataset directory (download from Zenodo)

Additional Resources:

entitymatching/README.md - Entity matching experiments with ScaNN TF-IDF blocking
initial_search_impact/README.md - Initial search composition impact simulations
downstream_eval/README.md - Downstream evaluation experiments guide
ablation_study_d_p/README.md - Ablation study on domain size and penalty parameters
GMM/README.md - GMM diversity maximization experiments

Running the Search Systems

Prerequisites

All systems require the following data structure in the data/ folder:

query/ - Query tables
datalake/ - Candidate unionable tables
vectors/ - Pre-computed column embeddings (see below)
diveristy_data/search_results/Starmie/ - Initial Starmie search results
Alignment files (manual or CL_KMEANS generated)

Generating Column and Table Embeddings

Before running the search systems, you must first:

Generate column embeddings using Starmie: https://github.com/megagonlabs/starmie
Extract table embeddings using TABBIE: https://github.com/SFIG611/tabbie

Generating Column Alignments

Before running the search systems, you need to create column alignments between query tables and data lake tables. These alignments identify which columns across different tables are semantically equivalent. They can be provided as ground truth or generated automatically by a column-alignment system.

Automatic Alignment Generation

Use preprocess_align_parallelize.py to automatically generate alignments using COP-KMeans clustering with column embeddings:

Edit configuration in the script (lines 65-89):

benchmark_name = "table-union-search-benchmark"  # Dataset name
embedding_type = "roberta_serialized"  # Embedding type to use

# Configure paths
dl_table_folder = "data/table-union-search-benchmark/small/datalake"
query_table_folder = "data/table-union-search-benchmark/small/query"
groundtruth_file = "data/table-union-search-benchmark/small/tus_small_noverlap_groundtruth_all.csv"
output_file = "data/table-union-search-benchmark/small/CL_KMEANS_cosine_alignment_tus_benchmark_small_all.csv"

Run the alignment generation:

python preprocess_align_parallelize.py

Key Features:

Parallelized Processing: Uses multiprocessing to handle multiple query tables simultaneously
Constraint-based Clustering: COP-KMeans prevents columns from the same table being clustered together
Multiple Embeddings: Supports BERT, RoBERTa, Sentence-BERT, GloVe, FastText, and Starmie embeddings
Automatic Validation: Verifies ground truth consistency before processing

Output: Creates CSV file with alignments:

query_table_name,query_column,query_column#,dl_table_name,dl_column#,dl_column
query1.csv,name,0,table1.csv,0,person_name
query1.csv,age,1,table1.csv,1,age
...

Processing Time: Varies by dataset size; uses all available CPU cores for parallelization.

Manual Alignment

Manual alignments used in this paper are available at: https://zenodo.org/records/15350355.
These human-curated alignments are typically more accurate but require manual effort to create.

Run systems

In principle, Starmie should be run on a data lake containing both unionable and non-unionable tables. However, to limit confounding factors in this work, we restrict our experiments to unionable tables only. Our rerankers operate on top of an initial search result; for consistency, we run all systems — including the rerankers — on the same set of unionable tables, specifically the top-20 candidates returned by Starmie.

Step 1: Run Starmie (Baseline Search)

Starmie generates the initial top-k search results that other systems re-rank.

python Starmie_search.py \
  --encoder cl \
  --benchmark ugen_v2/ugenv2_small \
  --augment_op drop_col \
  --sample_meth tfidf_entity \
  --table_order column \
  --K 20 \
  --threshold 0.6 \
  --restrict 1

Parameters:

--encoder: Embedding type (cl, sato, sherlock, tapex)
--benchmark: Dataset path (e.g., santos/small, ugen_v2/ugenv2_small, table-union-search-benchmark/small)
--K: Number of top results to return
--restrict: 1 = use only unionable tables from ground truth

Output: Creates pickle files with Starmie results needed by re-ranking systems.

Step 2: Run Re-ranking Systems

After running Starmie, you can run any of the following re-ranking methods. Each system requires editing configuration variables in the Python file before running.

ANTs (Penalized Search)

Edit configuration in Penalized_search.py (lines 382-454):

dataFolder="data/ugen_v2/ugenv2_small"  # Choose dataset
alignment_file_name="ugenv2_small_manual_alignment_all.csv"
p_degree=1  # Penalty degree
dsize=20    # Domain size

python Penalized_search.py

Output: data/{dataset}/diveristy_data/search_results/Penalized/search_result_penalize_*.csv

GMC

Edit configuration in GMC_search.py (lines 1880-1950):

benchmark_path="data/ugen_v2/ugenv2_small/"
alignment_Dust=benchmark_path+"CL_KMEANS_cosine_alignment_ugenv2_small_all.csv"
domainsize=20

python GMC_search.py

Output: data/{dataset}/diveristy_data/search_results/GMC/gmc_results_*.csv

SemNov (Semantic Novelty)

Edit configuration in Semantic_Novelty_search.py (lines 322-441):

bpath="data/ugen_v2/ugenv2_small/"
alignment_Dust=f"{bpath}CL_KMEANS_cosine_alignment_ugenv2_small_all.csv"

Note: Requires TABBIE vectors in {bpath}TABBIE_vectors/ # Penalization degree; initialized in the body of the class to 1

python Semantic_Novelty_search.py

Output: data/{dataset}/diveristy_data/search_results/semanticNovelty/search_result_semNovelty_*.csv

StarmieGold (Starmie0)

Edit configuration in Starmie0_search.py (lines 393-465):

dataFolder="data/santos/small"
alignment_file_name="Manual_Alignment_4gtruth_santos_small_all.csv"

python Starmie0_search.py

Output: data/{dataset}/diveristy_data/search_results/starmie0/search_result_starmie0_*.csv

StarmieOne (Starmie1)

Edit configuration in Starmie1_search.py (lines 386-468):

dataFolder="data/santos/small"
alignment_file_name="Manual_Alignment_4gtruth_santos_small_all.csv"

python Starmie1_search.py

Output: data/{dataset}/diveristy_data/search_results/starmie1/search_result_starmie1_*.csv

Computing Metrics

Using Configuration File (Recommended)

The easiest way to compute N-Score metrics is using the configuration file:

Edit the configuration file config/nscore_config.yaml:

global:
  alpha: 1.0
  beta: 0.9
  use_dynamic_beta: false  # Use data-driven beta per query
  use_global_average_beta: false  # Use single averaged beta across queries
  use_per_column_beta: true  # Use separate beta per column
  use_nonzero_count: true
  Q_N: 19  # Number of queries to process
  Ks: [2, 3, 5, 10]  # K values to evaluate
  alignments_file: "data/ugen_v2/ugenv2_small/ugenv2_small_manual_alignment_all.csv"
  query_path: "data/ugen_v2/ugenv2_small/query"
  table_path: "data/ugen_v2/ugenv2_small/datalake"

methods:
  Penalized:
    enabled: true
    diversity_data_path: "data/ugen_v2/ugenv2_small/diveristy_data/search_results/Penalized/"
    result_file: "data/ugen_v2/ugenv2_small/diveristy_data/search_results/Penalized/search_result_penalize_*.csv"

Run the evaluation:

python evaluate_novelty_plus.py --config config/nscore_config.yaml

Configuration Parameters:

alpha: Weight for unionability score (typically 1.0)
beta: Weight for novelty penalty (0.0-1.0, higher = more novelty emphasis)
use_dynamic_beta: Compute beta per query based on data distribution
use_global_average_beta: Use single averaged beta across all queries
use_per_column_beta: Compute separate beta for each column
Q_N: Number of queries to process (0 = all queries)
Ks: List of k values to evaluate

Using Python API

For programmatic use, you can call the functions directly:

from evaluate_novelty_plus import (
    compute_syntactic_novelty_measure,
    compute_syntactic_novelty_measure_simplified,
    nscore_result
)

# Compute SNM
compute_syntactic_novelty_measure(
    groundtruth_file="data/ugen_v2/ugenv2_small/ugenv2_small_unionable_groundtruth_diluted.pickle",
    search_result="data/ugen_v2/ugenv2_small/diveristy_data/search_results/Penalized/search_result_penalize_*.csv",
    snm_avg_result_path_="output/snm_avg.csv",
    snm_whole_result_path_="output/snm_whole.csv",
    remove_duplicate=0
)

# Compute SSNM (Simplified version)
compute_syntactic_novelty_measure_simplified(
    groundtruth_file="data/ugen_v2/ugenv2_small/ugenv2_small_unionable_groundtruth_diluted.pickle",
    search_result="data/ugen_v2/ugenv2_small/diveristy_data/search_results/Penalized/search_result_penalize_*.csv",
    snm_avg_result_path_="output/ssnm_avg.csv",
    snm_whole_result_path_="output/ssnm_whole.csv",
    remove_dup=0
)

# Alternative  N-Score Computation
nscore_result(
    output_folder_by_query="output/nscore_per_query/",
    result_file="data/ugen_v2/ugenv2_small/diveristy_data/search_results/Penalized/search_result_penalize_*.csv",
    output_file="output/nscore_results.csv",
    alignments_file="data/ugen_v2/ugenv2_small/ugenv2_small_manual_alignment_all.csv",
    query_path_="data/ugen_v2/ugenv2_small/query",
    table_path_="data/ugen_v2/ugenv2_small/datalake",
    alph=1.0,  # Alpha parameter
    beta=0.9,  # Beta parameter
    Ks={2, 5, 10},  # K values to evaluate
    Q_N=10,  # Number of queries to process (0 = all)
    normalized=0,
    use_dynamic_beta=False
)

Citation

If you use this code in your research, please cite:

@article{nts2026,
  title={Novelty-Aware Union Search for Data Lakes},
  author={Kassaie, Besat and Renee Miller},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Novel Table Search (NTS)

Overview

Technical Report

Core Components

Datasets

Dependencies

Repository Structure

Running the Search Systems

Prerequisites

Generating Column and Table Embeddings

Generating Column Alignments

Automatic Alignment Generation

Manual Alignment

Run systems

Step 1: Run Starmie (Baseline Search)

Step 2: Run Re-ranking Systems

ANTs (Penalized Search)

GMC

SemNov (Semantic Novelty)

StarmieGold (Starmie0)

StarmieOne (Starmie1)

Computing Metrics

Using Configuration File (Recommended)

Using Python API

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.vscode		.vscode
GMM		GMM
ablation_study_d_p		ablation_study_d_p
config		config
copkmeans		copkmeans
data		data
downstream_eval		downstream_eval
entitymatching		entitymatching
initial_search_impact		initial_search_impact
sdd		sdd
.gitignore		.gitignore
GMC_search.py		GMC_search.py
NTS_Technical_Report.pdf		NTS_Technical_Report.pdf
Penalized_search.py		Penalized_search.py
README.md		README.md
Semantic_Novelty_search.py		Semantic_Novelty_search.py
Starmie0_search.py		Starmie0_search.py
Starmie1_search.py		Starmie1_search.py
Starmie_search.py		Starmie_search.py
bounds.py		bounds.py
checkPrecisionRecall.py		checkPrecisionRecall.py
enter_corpus_construct.py		enter_corpus_construct.py
evaluate_novelty.py		evaluate_novelty.py
evaluate_novelty_draw_graphs.py		evaluate_novelty_draw_graphs.py
evaluate_novelty_main.py		evaluate_novelty_main.py
evaluate_novelty_plus.py		evaluate_novelty_plus.py
naive_search_Novelty.py		naive_search_Novelty.py
overall.jpg		overall.jpg
preprocess_align_parallelize.py		preprocess_align_parallelize.py
process_column.py		process_column.py
requirements.txt		requirements.txt
test_naive_search_Novelty.py		test_naive_search_Novelty.py
utilities.py		utilities.py

Folders and files

Latest commit

History

Repository files navigation

Novel Table Search (NTS)

Overview

Technical Report

Core Components

Datasets

Dependencies

Repository Structure

Running the Search Systems

Prerequisites

Generating Column and Table Embeddings

Generating Column Alignments

Automatic Alignment Generation

Manual Alignment

Run systems

Step 1: Run Starmie (Baseline Search)

Step 2: Run Re-ranking Systems

ANTs (Penalized Search)

GMC

SemNov (Semantic Novelty)

StarmieGold (Starmie0)

StarmieOne (Starmie1)

Computing Metrics

Using Configuration File (Recommended)

Using Python API

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 0

Languages

Packages

Contributors