Skip to content

Besatkassaie/NTS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Novel Table Search (NTS)

Overall

Overview

This repository contains the code and experiments for our paper on Novel Table Search (NTS) over data lakes. Given a query table and a set of unionable candidate tables, NTS aims to find tables that are not only unionable but also novel, i.e., they add new, non-redundant information instead of repeating what is already in the query or in other selected tables. We formalize NTS, propose a syntactic novelty scoring function nscore that penalizes duplicate content, and show that optimizing it exactly is NP-hard. To address this, we introduce our main approximation algorithm ANTS, along with three additional methods (GMC, ERNov, and SemNov) that capture syntactic or semantic novelty in different ways. The repo includes implementations of all methods, scripts to reproduce our experiments on recent unionable-table benchmarks (e.g., Ugen-v2), and code for our downstream ML task where we show that novelty-aware table selection can improve model performance.

Technical Report

For detailed technical information about the algorithms, metrics, and experimental setup, see NTS_Technical_Report.pdf.

Core Components

Search Systems: Seven implementations of novelty-aware table search algorithms:

  • ANTs (Penalized_search.py) - Penalizes redundant tables using diversity-aware scoring
  • GMC (GMC_search.py) - Graph-based matching with column-level similarity
  • Starmie (Starmie_search.py) - Baseline contrastive learning approach
  • SemNov (Semantic_Novelty_search.py) - Semantic novelty maximization
  • StarmieGold (Starmie0_search.py) - Enhanced Starmie with ground truth alignment
  • StarmieOne (Starmie1_search.py) - Single-alignment Starmie variant
  • ER (entitymatching/) - Entity resolution-based table search with ScaNN blocking

Evaluation Metrics: We introduce three metrics for measuring syntactic novelty in search results:

  • SNM (Syntactic Novelty Measure) – used for systems that output ranked results (ANTS, Semnove, ER).
  • SSNM (Simplified SNM) – a simplified variant of SNM for systems without ranking (GMC).
  • N-score (Novelty Score) – computes the novelty of the returned results, enabling a direct comparison of how effectively different systems achieve novelty.

Data Processing and Evaluation Pipeline:

  • Alignment Generation (preprocess_align_parallelize.py) - Automatic column alignment using COP-KMeans clustering
  • Evaluation Tools (evaluate_novelty_plus.py) - Configurable N-score computation

Experimental Frameworks:

  • Entity Matching (entitymatching/) - ScaNN-based blocking, matching, and ranking. This is used in our ER system.
  • Initial Search Impact (initial_search_impact/) - Analysis of search composition effects
  • Downstream Evaluation (downstream_eval/) - ML-based task evaluation
  • Ablation Studies (ablation_study_d_p/) - Parameter sensitivity analysis for domain size and penalty
  • Diversity Maximization (GMM/) - A Max–Min diversification baseline with approximation guarantees (https://web.archive.org/web/20170809211944id_/http://www.cs.toronto.edu/~eidan/papers/huristics.pdf), used to compare against the novelty score (N-score).

Datasets

The repository also includes the code developed for constructing the customized benchmark datasets and performing data dilution. The datasets, along with various data representations and ground truth files, are shared at the following link: https://zenodo.org/records/15350355

Dependencies

See requirements.txt for a complete list of Python dependencies.

To install all dependencies:

pip install -r requirements.txt

Key Requirements:

  • Python 3.8+
  • PyTorch 1.10+
  • scikit-learn
  • transformers
  • sentence-transformers
  • pandas, numpy, scipy
  • xgboost, lightgbm
  • matplotlib, seaborn

Repository Structure

NTS/
├── Penalized_search.py              # ANTs system implementation
├── GMC_search.py                    # GMC system implementation
├── Starmie_search.py                # Starmie baseline search
├── Starmie0_search.py               # StarmieGold system
├── Starmie1_search.py               # StarmieOne system
├── Semantic_Novelty_search.py       # SemNov system implementation
├── evaluate_novelty_plus.py         # Evaluation metrics (SNM, SSNM, N-Score)
├── preprocess_align_parallelize.py  # Automatic column alignment generation
├── utilities.py                     # Shared utility functions
├── config/                          # Configuration files for evaluation
├── entitymatching/                  # ScaNN-based blocking, matching, and ranking pipeline
├── initial_search_impact/           # Initial search composition impact analysis
├── downstream_eval/                 # Downstream task evaluation with ML models
├── ablation_study_d_p/              # Ablation study on domain size (dsize) and penalty degree (p) parameters
├── GMM/                             # Greedy Maximum-Minimum diversity maximization
└── data/                            # Dataset directory (download from Zenodo)

Additional Resources:

Running the Search Systems

Prerequisites

All systems require the following data structure in the data/ folder:

  • query/ - Query tables
  • datalake/ - Candidate unionable tables
  • vectors/ - Pre-computed column embeddings (see below)
  • diveristy_data/search_results/Starmie/ - Initial Starmie search results
  • Alignment files (manual or CL_KMEANS generated)

Generating Column and Table Embeddings

Before running the search systems, you must first:

  1. Generate column embeddings using Starmie: https://github.com/megagonlabs/starmie
  2. Extract table embeddings using TABBIE: https://github.com/SFIG611/tabbie

Generating Column Alignments

Before running the search systems, you need to create column alignments between query tables and data lake tables. These alignments identify which columns across different tables are semantically equivalent. They can be provided as ground truth or generated automatically by a column-alignment system.

Automatic Alignment Generation

Use preprocess_align_parallelize.py to automatically generate alignments using COP-KMeans clustering with column embeddings:

Edit configuration in the script (lines 65-89):

benchmark_name = "table-union-search-benchmark"  # Dataset name
embedding_type = "roberta_serialized"  # Embedding type to use

# Configure paths
dl_table_folder = "data/table-union-search-benchmark/small/datalake"
query_table_folder = "data/table-union-search-benchmark/small/query"
groundtruth_file = "data/table-union-search-benchmark/small/tus_small_noverlap_groundtruth_all.csv"
output_file = "data/table-union-search-benchmark/small/CL_KMEANS_cosine_alignment_tus_benchmark_small_all.csv"

Run the alignment generation:

python preprocess_align_parallelize.py

Key Features:

  • Parallelized Processing: Uses multiprocessing to handle multiple query tables simultaneously
  • Constraint-based Clustering: COP-KMeans prevents columns from the same table being clustered together
  • Multiple Embeddings: Supports BERT, RoBERTa, Sentence-BERT, GloVe, FastText, and Starmie embeddings
  • Automatic Validation: Verifies ground truth consistency before processing

Output: Creates CSV file with alignments:

query_table_name,query_column,query_column#,dl_table_name,dl_column#,dl_column
query1.csv,name,0,table1.csv,0,person_name
query1.csv,age,1,table1.csv,1,age
...

Processing Time: Varies by dataset size; uses all available CPU cores for parallelization.

Manual Alignment

Manual alignments used in this paper are available at: https://zenodo.org/records/15350355.
These human-curated alignments are typically more accurate but require manual effort to create.

Run systems

In principle, Starmie should be run on a data lake containing both unionable and non-unionable tables. However, to limit confounding factors in this work, we restrict our experiments to unionable tables only. Our rerankers operate on top of an initial search result; for consistency, we run all systems — including the rerankers — on the same set of unionable tables, specifically the top-20 candidates returned by Starmie.

Step 1: Run Starmie (Baseline Search)

Starmie generates the initial top-k search results that other systems re-rank.

python Starmie_search.py \
  --encoder cl \
  --benchmark ugen_v2/ugenv2_small \
  --augment_op drop_col \
  --sample_meth tfidf_entity \
  --table_order column \
  --K 20 \
  --threshold 0.6 \
  --restrict 1

Parameters:

  • --encoder: Embedding type (cl, sato, sherlock, tapex)
  • --benchmark: Dataset path (e.g., santos/small, ugen_v2/ugenv2_small, table-union-search-benchmark/small)
  • --K: Number of top results to return
  • --restrict: 1 = use only unionable tables from ground truth

Output: Creates pickle files with Starmie results needed by re-ranking systems.

Step 2: Run Re-ranking Systems

After running Starmie, you can run any of the following re-ranking methods. Each system requires editing configuration variables in the Python file before running.

ANTs (Penalized Search)

Edit configuration in Penalized_search.py (lines 382-454):

dataFolder="data/ugen_v2/ugenv2_small"  # Choose dataset
alignment_file_name="ugenv2_small_manual_alignment_all.csv"
p_degree=1  # Penalty degree
dsize=20    # Domain size
python Penalized_search.py

Output: data/{dataset}/diveristy_data/search_results/Penalized/search_result_penalize_*.csv

GMC

Edit configuration in GMC_search.py (lines 1880-1950):

benchmark_path="data/ugen_v2/ugenv2_small/"
alignment_Dust=benchmark_path+"CL_KMEANS_cosine_alignment_ugenv2_small_all.csv"
domainsize=20
python GMC_search.py

Output: data/{dataset}/diveristy_data/search_results/GMC/gmc_results_*.csv

SemNov (Semantic Novelty)

Edit configuration in Semantic_Novelty_search.py (lines 322-441):

bpath="data/ugen_v2/ugenv2_small/"
alignment_Dust=f"{bpath}CL_KMEANS_cosine_alignment_ugenv2_small_all.csv"

Note: Requires TABBIE vectors in {bpath}TABBIE_vectors/ # Penalization degree; initialized in the body of the class to 1

python Semantic_Novelty_search.py

Output: data/{dataset}/diveristy_data/search_results/semanticNovelty/search_result_semNovelty_*.csv

StarmieGold (Starmie0)

Edit configuration in Starmie0_search.py (lines 393-465):

dataFolder="data/santos/small"
alignment_file_name="Manual_Alignment_4gtruth_santos_small_all.csv"
python Starmie0_search.py

Output: data/{dataset}/diveristy_data/search_results/starmie0/search_result_starmie0_*.csv

StarmieOne (Starmie1)

Edit configuration in Starmie1_search.py (lines 386-468):

dataFolder="data/santos/small"
alignment_file_name="Manual_Alignment_4gtruth_santos_small_all.csv"
python Starmie1_search.py

Output: data/{dataset}/diveristy_data/search_results/starmie1/search_result_starmie1_*.csv

Computing Metrics

Using Configuration File (Recommended)

The easiest way to compute N-Score metrics is using the configuration file:

  1. Edit the configuration file config/nscore_config.yaml:
global:
  alpha: 1.0
  beta: 0.9
  use_dynamic_beta: false  # Use data-driven beta per query
  use_global_average_beta: false  # Use single averaged beta across queries
  use_per_column_beta: true  # Use separate beta per column
  use_nonzero_count: true
  Q_N: 19  # Number of queries to process
  Ks: [2, 3, 5, 10]  # K values to evaluate
  alignments_file: "data/ugen_v2/ugenv2_small/ugenv2_small_manual_alignment_all.csv"
  query_path: "data/ugen_v2/ugenv2_small/query"
  table_path: "data/ugen_v2/ugenv2_small/datalake"

methods:
  Penalized:
    enabled: true
    diversity_data_path: "data/ugen_v2/ugenv2_small/diveristy_data/search_results/Penalized/"
    result_file: "data/ugen_v2/ugenv2_small/diveristy_data/search_results/Penalized/search_result_penalize_*.csv"
  1. Run the evaluation:
python evaluate_novelty_plus.py --config config/nscore_config.yaml

Configuration Parameters:

  • alpha: Weight for unionability score (typically 1.0)
  • beta: Weight for novelty penalty (0.0-1.0, higher = more novelty emphasis)
  • use_dynamic_beta: Compute beta per query based on data distribution
  • use_global_average_beta: Use single averaged beta across all queries
  • use_per_column_beta: Compute separate beta for each column
  • Q_N: Number of queries to process (0 = all queries)
  • Ks: List of k values to evaluate

Using Python API

For programmatic use, you can call the functions directly:

from evaluate_novelty_plus import (
    compute_syntactic_novelty_measure,
    compute_syntactic_novelty_measure_simplified,
    nscore_result
)

# Compute SNM
compute_syntactic_novelty_measure(
    groundtruth_file="data/ugen_v2/ugenv2_small/ugenv2_small_unionable_groundtruth_diluted.pickle",
    search_result="data/ugen_v2/ugenv2_small/diveristy_data/search_results/Penalized/search_result_penalize_*.csv",
    snm_avg_result_path_="output/snm_avg.csv",
    snm_whole_result_path_="output/snm_whole.csv",
    remove_duplicate=0
)

# Compute SSNM (Simplified version)
compute_syntactic_novelty_measure_simplified(
    groundtruth_file="data/ugen_v2/ugenv2_small/ugenv2_small_unionable_groundtruth_diluted.pickle",
    search_result="data/ugen_v2/ugenv2_small/diveristy_data/search_results/Penalized/search_result_penalize_*.csv",
    snm_avg_result_path_="output/ssnm_avg.csv",
    snm_whole_result_path_="output/ssnm_whole.csv",
    remove_dup=0
)

# Alternative  N-Score Computation
nscore_result(
    output_folder_by_query="output/nscore_per_query/",
    result_file="data/ugen_v2/ugenv2_small/diveristy_data/search_results/Penalized/search_result_penalize_*.csv",
    output_file="output/nscore_results.csv",
    alignments_file="data/ugen_v2/ugenv2_small/ugenv2_small_manual_alignment_all.csv",
    query_path_="data/ugen_v2/ugenv2_small/query",
    table_path_="data/ugen_v2/ugenv2_small/datalake",
    alph=1.0,  # Alpha parameter
    beta=0.9,  # Beta parameter
    Ks={2, 5, 10},  # K values to evaluate
    Q_N=10,  # Number of queries to process (0 = all)
    normalized=0,
    use_dynamic_beta=False
)

Citation

If you use this code in your research, please cite:

@article{nts2026,
  title={Novelty-Aware Union Search for Data Lakes},
  author={Kassaie, Besat and Renee Miller},
  year={2026}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

No contributors

Languages