This repository contains the code and experiments for our paper on Novel Table Search (NTS) over data lakes. Given a query table and a set of unionable candidate tables, NTS aims to find tables that are not only unionable but also novel, i.e., they add new, non-redundant information instead of repeating what is already in the query or in other selected tables. We formalize NTS, propose a syntactic novelty scoring function nscore that penalizes duplicate content, and show that optimizing it exactly is NP-hard. To address this, we introduce our main approximation algorithm ANTS, along with three additional methods (GMC, ERNov, and SemNov) that capture syntactic or semantic novelty in different ways. The repo includes implementations of all methods, scripts to reproduce our experiments on recent unionable-table benchmarks (e.g., Ugen-v2), and code for our downstream ML task where we show that novelty-aware table selection can improve model performance.
For detailed technical information about the algorithms, metrics, and experimental setup, see NTS_Technical_Report.pdf.
Search Systems: Seven implementations of novelty-aware table search algorithms:
- ANTs (Penalized_search.py) - Penalizes redundant tables using diversity-aware scoring
- GMC (GMC_search.py) - Graph-based matching with column-level similarity
- Starmie (Starmie_search.py) - Baseline contrastive learning approach
- SemNov (Semantic_Novelty_search.py) - Semantic novelty maximization
- StarmieGold (Starmie0_search.py) - Enhanced Starmie with ground truth alignment
- StarmieOne (Starmie1_search.py) - Single-alignment Starmie variant
- ER (entitymatching/) - Entity resolution-based table search with ScaNN blocking
Evaluation Metrics: We introduce three metrics for measuring syntactic novelty in search results:
- SNM (Syntactic Novelty Measure) – used for systems that output ranked results (ANTS, Semnove, ER).
- SSNM (Simplified SNM) – a simplified variant of SNM for systems without ranking (GMC).
- N-score (Novelty Score) – computes the novelty of the returned results, enabling a direct comparison of how effectively different systems achieve novelty.
Data Processing and Evaluation Pipeline:
- Alignment Generation (preprocess_align_parallelize.py) - Automatic column alignment using COP-KMeans clustering
- Evaluation Tools (evaluate_novelty_plus.py) - Configurable N-score computation
Experimental Frameworks:
- Entity Matching (entitymatching/) - ScaNN-based blocking, matching, and ranking. This is used in our ER system.
- Initial Search Impact (initial_search_impact/) - Analysis of search composition effects
- Downstream Evaluation (downstream_eval/) - ML-based task evaluation
- Ablation Studies (ablation_study_d_p/) - Parameter sensitivity analysis for domain size and penalty
- Diversity Maximization (GMM/) - A Max–Min diversification baseline with approximation guarantees (https://web.archive.org/web/20170809211944id_/http://www.cs.toronto.edu/~eidan/papers/huristics.pdf), used to compare against the novelty score (N-score).
The repository also includes the code developed for constructing the customized benchmark datasets and performing data dilution. The datasets, along with various data representations and ground truth files, are shared at the following link: https://zenodo.org/records/15350355
See requirements.txt for a complete list of Python dependencies.
To install all dependencies:
pip install -r requirements.txtKey Requirements:
- Python 3.8+
- PyTorch 1.10+
- scikit-learn
- transformers
- sentence-transformers
- pandas, numpy, scipy
- xgboost, lightgbm
- matplotlib, seaborn
NTS/
├── Penalized_search.py # ANTs system implementation
├── GMC_search.py # GMC system implementation
├── Starmie_search.py # Starmie baseline search
├── Starmie0_search.py # StarmieGold system
├── Starmie1_search.py # StarmieOne system
├── Semantic_Novelty_search.py # SemNov system implementation
├── evaluate_novelty_plus.py # Evaluation metrics (SNM, SSNM, N-Score)
├── preprocess_align_parallelize.py # Automatic column alignment generation
├── utilities.py # Shared utility functions
├── config/ # Configuration files for evaluation
├── entitymatching/ # ScaNN-based blocking, matching, and ranking pipeline
├── initial_search_impact/ # Initial search composition impact analysis
├── downstream_eval/ # Downstream task evaluation with ML models
├── ablation_study_d_p/ # Ablation study on domain size (dsize) and penalty degree (p) parameters
├── GMM/ # Greedy Maximum-Minimum diversity maximization
└── data/ # Dataset directory (download from Zenodo)
Additional Resources:
- entitymatching/README.md - Entity matching experiments with ScaNN TF-IDF blocking
- initial_search_impact/README.md - Initial search composition impact simulations
- downstream_eval/README.md - Downstream evaluation experiments guide
- ablation_study_d_p/README.md - Ablation study on domain size and penalty parameters
- GMM/README.md - GMM diversity maximization experiments
All systems require the following data structure in the data/ folder:
query/- Query tablesdatalake/- Candidate unionable tablesvectors/- Pre-computed column embeddings (see below)diveristy_data/search_results/Starmie/- Initial Starmie search results- Alignment files (manual or CL_KMEANS generated)
Before running the search systems, you must first:
- Generate column embeddings using Starmie: https://github.com/megagonlabs/starmie
- Extract table embeddings using TABBIE: https://github.com/SFIG611/tabbie
Before running the search systems, you need to create column alignments between query tables and data lake tables. These alignments identify which columns across different tables are semantically equivalent. They can be provided as ground truth or generated automatically by a column-alignment system.
Use preprocess_align_parallelize.py to automatically generate alignments using COP-KMeans clustering with column embeddings:
Edit configuration in the script (lines 65-89):
benchmark_name = "table-union-search-benchmark" # Dataset name
embedding_type = "roberta_serialized" # Embedding type to use
# Configure paths
dl_table_folder = "data/table-union-search-benchmark/small/datalake"
query_table_folder = "data/table-union-search-benchmark/small/query"
groundtruth_file = "data/table-union-search-benchmark/small/tus_small_noverlap_groundtruth_all.csv"
output_file = "data/table-union-search-benchmark/small/CL_KMEANS_cosine_alignment_tus_benchmark_small_all.csv"Run the alignment generation:
python preprocess_align_parallelize.pyKey Features:
- Parallelized Processing: Uses multiprocessing to handle multiple query tables simultaneously
- Constraint-based Clustering: COP-KMeans prevents columns from the same table being clustered together
- Multiple Embeddings: Supports BERT, RoBERTa, Sentence-BERT, GloVe, FastText, and Starmie embeddings
- Automatic Validation: Verifies ground truth consistency before processing
Output: Creates CSV file with alignments:
query_table_name,query_column,query_column#,dl_table_name,dl_column#,dl_column
query1.csv,name,0,table1.csv,0,person_name
query1.csv,age,1,table1.csv,1,age
...Processing Time: Varies by dataset size; uses all available CPU cores for parallelization.
Manual alignments used in this paper are available at: https://zenodo.org/records/15350355.
These human-curated alignments are typically more accurate but require manual effort to create.
In principle, Starmie should be run on a data lake containing both unionable and non-unionable tables. However, to limit confounding factors in this work, we restrict our experiments to unionable tables only. Our rerankers operate on top of an initial search result; for consistency, we run all systems — including the rerankers — on the same set of unionable tables, specifically the top-20 candidates returned by Starmie.
Starmie generates the initial top-k search results that other systems re-rank.
python Starmie_search.py \
--encoder cl \
--benchmark ugen_v2/ugenv2_small \
--augment_op drop_col \
--sample_meth tfidf_entity \
--table_order column \
--K 20 \
--threshold 0.6 \
--restrict 1Parameters:
--encoder: Embedding type (cl, sato, sherlock, tapex)--benchmark: Dataset path (e.g.,santos/small,ugen_v2/ugenv2_small,table-union-search-benchmark/small)--K: Number of top results to return--restrict: 1 = use only unionable tables from ground truth
Output: Creates pickle files with Starmie results needed by re-ranking systems.
After running Starmie, you can run any of the following re-ranking methods. Each system requires editing configuration variables in the Python file before running.
Edit configuration in Penalized_search.py (lines 382-454):
dataFolder="data/ugen_v2/ugenv2_small" # Choose dataset
alignment_file_name="ugenv2_small_manual_alignment_all.csv"
p_degree=1 # Penalty degree
dsize=20 # Domain sizepython Penalized_search.pyOutput: data/{dataset}/diveristy_data/search_results/Penalized/search_result_penalize_*.csv
Edit configuration in GMC_search.py (lines 1880-1950):
benchmark_path="data/ugen_v2/ugenv2_small/"
alignment_Dust=benchmark_path+"CL_KMEANS_cosine_alignment_ugenv2_small_all.csv"
domainsize=20python GMC_search.pyOutput: data/{dataset}/diveristy_data/search_results/GMC/gmc_results_*.csv
Edit configuration in Semantic_Novelty_search.py (lines 322-441):
bpath="data/ugen_v2/ugenv2_small/"
alignment_Dust=f"{bpath}CL_KMEANS_cosine_alignment_ugenv2_small_all.csv"Note: Requires TABBIE vectors in {bpath}TABBIE_vectors/
# Penalization degree; initialized in the body of the class to 1
python Semantic_Novelty_search.pyOutput: data/{dataset}/diveristy_data/search_results/semanticNovelty/search_result_semNovelty_*.csv
Edit configuration in Starmie0_search.py (lines 393-465):
dataFolder="data/santos/small"
alignment_file_name="Manual_Alignment_4gtruth_santos_small_all.csv"python Starmie0_search.pyOutput: data/{dataset}/diveristy_data/search_results/starmie0/search_result_starmie0_*.csv
Edit configuration in Starmie1_search.py (lines 386-468):
dataFolder="data/santos/small"
alignment_file_name="Manual_Alignment_4gtruth_santos_small_all.csv"python Starmie1_search.pyOutput: data/{dataset}/diveristy_data/search_results/starmie1/search_result_starmie1_*.csv
The easiest way to compute N-Score metrics is using the configuration file:
- Edit the configuration file
config/nscore_config.yaml:
global:
alpha: 1.0
beta: 0.9
use_dynamic_beta: false # Use data-driven beta per query
use_global_average_beta: false # Use single averaged beta across queries
use_per_column_beta: true # Use separate beta per column
use_nonzero_count: true
Q_N: 19 # Number of queries to process
Ks: [2, 3, 5, 10] # K values to evaluate
alignments_file: "data/ugen_v2/ugenv2_small/ugenv2_small_manual_alignment_all.csv"
query_path: "data/ugen_v2/ugenv2_small/query"
table_path: "data/ugen_v2/ugenv2_small/datalake"
methods:
Penalized:
enabled: true
diversity_data_path: "data/ugen_v2/ugenv2_small/diveristy_data/search_results/Penalized/"
result_file: "data/ugen_v2/ugenv2_small/diveristy_data/search_results/Penalized/search_result_penalize_*.csv"- Run the evaluation:
python evaluate_novelty_plus.py --config config/nscore_config.yamlConfiguration Parameters:
alpha: Weight for unionability score (typically 1.0)beta: Weight for novelty penalty (0.0-1.0, higher = more novelty emphasis)use_dynamic_beta: Compute beta per query based on data distributionuse_global_average_beta: Use single averaged beta across all queriesuse_per_column_beta: Compute separate beta for each columnQ_N: Number of queries to process (0 = all queries)Ks: List of k values to evaluate
For programmatic use, you can call the functions directly:
from evaluate_novelty_plus import (
compute_syntactic_novelty_measure,
compute_syntactic_novelty_measure_simplified,
nscore_result
)
# Compute SNM
compute_syntactic_novelty_measure(
groundtruth_file="data/ugen_v2/ugenv2_small/ugenv2_small_unionable_groundtruth_diluted.pickle",
search_result="data/ugen_v2/ugenv2_small/diveristy_data/search_results/Penalized/search_result_penalize_*.csv",
snm_avg_result_path_="output/snm_avg.csv",
snm_whole_result_path_="output/snm_whole.csv",
remove_duplicate=0
)
# Compute SSNM (Simplified version)
compute_syntactic_novelty_measure_simplified(
groundtruth_file="data/ugen_v2/ugenv2_small/ugenv2_small_unionable_groundtruth_diluted.pickle",
search_result="data/ugen_v2/ugenv2_small/diveristy_data/search_results/Penalized/search_result_penalize_*.csv",
snm_avg_result_path_="output/ssnm_avg.csv",
snm_whole_result_path_="output/ssnm_whole.csv",
remove_dup=0
)
# Alternative N-Score Computation
nscore_result(
output_folder_by_query="output/nscore_per_query/",
result_file="data/ugen_v2/ugenv2_small/diveristy_data/search_results/Penalized/search_result_penalize_*.csv",
output_file="output/nscore_results.csv",
alignments_file="data/ugen_v2/ugenv2_small/ugenv2_small_manual_alignment_all.csv",
query_path_="data/ugen_v2/ugenv2_small/query",
table_path_="data/ugen_v2/ugenv2_small/datalake",
alph=1.0, # Alpha parameter
beta=0.9, # Beta parameter
Ks={2, 5, 10}, # K values to evaluate
Q_N=10, # Number of queries to process (0 = all)
normalized=0,
use_dynamic_beta=False
)If you use this code in your research, please cite:
@article{nts2026,
title={Novelty-Aware Union Search for Data Lakes},
author={Kassaie, Besat and Renee Miller},
year={2026}
}