Skip to content

refactor(hgraph): move duplicate tracking to graph layer#1797

Merged
inabao merged 3 commits intomainfrom
pyramid-duplicate
Apr 9, 2026
Merged

refactor(hgraph): move duplicate tracking to graph layer#1797
inabao merged 3 commits intomainfrom
pyramid-duplicate

Conversation

@inabao
Copy link
Copy Markdown
Collaborator

@inabao inabao commented Apr 3, 2026

Summary

  • move HGraph duplicate tracking from LabelTable into graph-level duplicate trackers and wire support_duplicate through HGraph and Pyramid graph params
  • restore legacy HGraph duplicate index compatibility during deserialize/search, including legacy metadata mapping and duplicate payload recovery for flat and compressed graph storage
  • add deterministic Pyramid duplicate tests that validate path-aware visibility semantics alongside the existing duplicate regression coverage

Testing

  • make release -j4
  • ./build-release/tests/functests \"Pyramid Duplicate Path Semantics Same Path\"
  • ./build-release/tests/functests \"Pyramid Duplicate Path Semantics Prefix Descendant\"
  • ./build-release/tests/functests \"Pyramid Duplicate Path Semantics Shared Prefix Visibility\"
  • ./build-release/tests/functests \"Pyramid Duplicate Path Semantics Negative Control\"
  • ./build-release/tests/functests \"Pyramid Duplicate Test\"
  • ./build-release/tests/functests \"Pyramid Duplicate ID Test\"

@inabao
Copy link
Copy Markdown
Collaborator Author

inabao commented Apr 3, 2026

Tracked by #1798

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors duplicate detection and tracking within the HGraph and Pyramid indices. It introduces a new DuplicateInterface along with DenseDuplicateTracker and SparseDuplicateTracker implementations, shifting duplicate management from the LabelTable to the graph data cells. The update includes changes to serialization to support versioning, refactored search logic to leverage the new trackers, and the addition of a hops_limit in Pyramid construction to prevent infinite loops. Feedback includes a suggestion to use the C++ library for better reproducibility in examples and the removal of unused variables in the Pyramid implementation.

ids[i] = i;
paths[i] = "a/b/c";
for (int j = 0; j < DIM; ++j) {
vectors[i * DIM + j] = static_cast<float>(rand()) / static_cast<float>(RAND_MAX);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better reproducibility and higher-quality random numbers in this example, consider using the C++ <random> library with a fixed seed instead of rand(). The previous version of this file used std::mt19937, which is a good practice to follow.

To do this, you would add #include <random> at the top of the file, and then before the loop, initialize the generator:

std::mt19937 rng(47); // Use a fixed seed for deterministic results
std::uniform_real_distribution<> distrib_real;

Then, you can replace this line with the suggestion below.

Suggested change
vectors[i * DIM + j] = static_cast<float>(rand()) / static_cast<float>(RAND_MAX);
vectors[i * DIM + j] = distrib_real(rng);

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example change has been reverted in a later commit, so the PR no longer modifies examples/cpp/107_index_pyramid.cpp.

auto inner_id = static_cast<InnerIdType>(i + local_cur_element_count);
const auto* vector = data_vectors + dim_ * data_bias;
int no_build_level_index = 0;
int add_count = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable add_count is initialized here, but its value is never used. This appears to be dead code and can be removed for clarity. The corresponding increment on line 529 should also be removed.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unused add_count declaration in a follow-up commit.

no_build_level_index++;
continue;
}
++add_count;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This increments the add_count variable, which is unused. This line can be removed along with the variable declaration on line 517.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the matching ++add_count increment together with the dead variable.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors duplicate-vector tracking by moving duplicate metadata from LabelTable into graph-level duplicate trackers, wiring support_duplicate through HGraph/Pyramid graph parameters, and updating search/serialization plus tests (including legacy HGraph duplicate-format compatibility).

Changes:

  • Introduce graph-level duplicate tracking via a new DuplicateInterface plus dense/sparse tracker implementations, serialized as part of graph storage.
  • Update HGraph/Pyramid to use graph-layer duplicates (and add legacy HGraph duplicate-format recovery on deserialize).
  • Expand/adjust functional + unit tests to cover Pyramid path-aware duplicate visibility and HGraph compressed-graph duplicate serialization.

Reviewed changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_pyramid.cpp Adds deterministic Pyramid duplicate path-semantics tests and small dataset helpers.
tests/test_hgraph.cpp Extends duplicate tests to cover compressed graph serialization; refactors duplicate test runner.
test_duplicate_plan.md Adds a written (Chinese) test plan for Pyramid duplicate detection scenarios.
src/inner_string_params.h Adds SUPPORT_DUPLICATE to the default formatting map.
src/impl/searcher/parallel_searcher.cpp Switches duplicate expansion from LabelTable to GraphInterface::GetDuplicateIds; applies attribute filter consistently.
src/impl/searcher/basic_searcher.cpp Switches duplicate expansion from LabelTable to GraphInterface::GetDuplicateIds.
src/impl/label_table.h Removes in-LabelTable duplicate storage and adds a duplicate-tracker hook + legacy-aware deserialize API.
src/impl/label_table.cpp Implements legacy duplicate-format deserialization into an injected duplicate tracker.
src/impl/label_table_test.cpp Removes LabelTable duplicate-specific tests; keeps basic serialize/deserialize count coverage.
src/impl/inner_search_param.h Simplifies copy-assignment to = default (covers new members like hops/duplicate flags).
src/datacell/sparse_graph_datacell.h Provides a sparse duplicate tracker factory for sparse graphs.
src/datacell/sparse_graph_datacell_parameter.cpp Parses/serializes support_duplicate and checks compatibility.
src/datacell/sparse_duplicate_tracker.h Adds sparse duplicate tracker implementation (map-based).
src/datacell/sparse_duplicate_tracker.cpp Implements sparse duplicate tracking + serialization + legacy-format import.
src/datacell/sparse_duplicate_tracker_test.cpp Adds unit tests for sparse duplicate tracker behavior and serialization.
src/datacell/graph_interface.h Adds duplicate tracker lifecycle + duplicate ID APIs; serializes tracker as part of graph header.
src/datacell/graph_interface.cpp Initializes duplicate tracker based on graph params during MakeInstance().
src/datacell/graph_interface_parameter.h Adds support_duplicate_ flag to graph parameter base class.
src/datacell/graph_datacell.h Adds dense duplicate tracker factory; resizes tracker with graph capacity.
src/datacell/graph_datacell_test.cpp Adds a unit test validating duplicate-tracker enable/disable behavior.
src/datacell/graph_datacell_parameter.cpp Parses/serializes support_duplicate and checks compatibility.
src/datacell/duplicate_interface.h Introduces the duplicate-tracker interface used by graph layers.
src/datacell/dense_duplicate_tracker.h Adds dense duplicate tracker implementation (ring/array-based).
src/datacell/dense_duplicate_tracker.cpp Implements dense duplicate tracking + serialization + legacy-format import + resize.
src/datacell/dense_duplicate_tracker_test.cpp Adds unit tests for dense duplicate tracker behavior/serialization/legacy import.
src/datacell/compressed_graph_datacell.h Adds sparse duplicate tracker factory for compressed graphs.
src/datacell/compressed_graph_datacell.cpp Resizes duplicate tracker with compressed graph capacity.
src/datacell/compressed_graph_datacell_test.cpp Adds a unit test validating duplicate tracker enablement for compressed graphs.
src/datacell/compressed_graph_datacell_parameter.h Adds support_duplicate (de)serialization + compatibility checking.
src/analyzer/hgraph_analyzer.cpp Updates analyzer duplicate handling to consult graph duplicate tracker instead of LabelTable internals.
src/algorithm/pyramid.h Stores support_duplicate_ on Pyramid instance (no longer tied to LabelTable).
src/algorithm/pyramid.cpp Wires duplicate support through graph params; sets duplicate IDs at graph layer; initializes tracker for node graphs.
src/algorithm/pyramid_zparameters_test.cpp Adds a unit test verifying support_duplicate maps into graph params.
src/algorithm/hgraph.h Extends label-info deserialization signature to handle legacy duplicate format.
src/algorithm/hgraph.cpp Wires graph-level duplicate tracking + legacy mapping; version-tags duplicate format in metadata.
src/algorithm/hgraph_parameter.cpp Propagates support_duplicate into bottom_graph_param->support_duplicate_.
src/algorithm/hgraph_parameter_test.cpp Adds a unit test for support_duplicate mapping; modifies compatibility test block.
examples/cpp/107_index_pyramid.cpp Updates example to demonstrate Pyramid duplicate detection with a small dataset.

auto inner_id = static_cast<InnerIdType>(i + local_cur_element_count);
const auto* vector = data_vectors + dim_ * data_bias;
int no_build_level_index = 0;
int add_count = 0;
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_count is declared but never used (and the increment below is also unused). This will at least trigger an unused-variable warning and may fail builds that treat warnings as errors. Please remove it or use it for an assertion/logging if intentional.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a follow-up commit by removing the unused variable and increment.

Comment on lines +164 to +168
RequireDistancesNearZero(const vsag::DatasetPtr& result, const std::set<int64_t>& expected_ids) {
for (int64_t i = 0; i < result->GetDim(); ++i) {
if (expected_ids.count(result->GetIds()[i]) != 0) {
REQUIRE(std::abs(result->GetDistances()[i]) <= 1e-6F);
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test uses std::abs(...) but the file does not include <cmath>, relying on transitive includes. Add <cmath> (or otherwise ensure the correct overload is visible) to avoid fragile builds / potential compilation failures.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an explicit <cmath> include in tests/test_pyramid.cpp so the std::abs call no longer relies on transitive includes.

Comment on lines +129 to +156
TEST_CASE("HGraph Parameters CheckCompatibility", "[ut][HGraphParameter][CheckCompatibility]"){
SECTION("wrong parameter type"){HGraphDefaultParam default_param;
auto param_str = generate_hgraph_param(default_param);
auto param = std::make_shared<vsag::HGraphParameter>();
param->FromString(param_str);
REQUIRE(param->CheckCompatibility(param));
REQUIRE_FALSE(param->CheckCompatibility(std::make_shared<vsag::EmptyParameter>()));
}

TEST_COMPATIBILITY_CASE(
"different base codes io type", base_codes_io_type, "memory_io", "block_memory_io", true)
TEST_COMPATIBILITY_CASE("different pq dim", base_codes_pq_dim, 8, 16, false)
TEST_COMPATIBILITY_CASE(
"different base codes quantization type", base_codes_quantization_type, "sq4", "sq8", false)
TEST_COMPATIBILITY_CASE("different graph type", graph_storage_type, "flat", "compressed", false)
TEST_COMPATIBILITY_CASE("different max degree", max_degree, 26, 30, false)
TEST_COMPATIBILITY_CASE("different support remove", support_remove, true, false, false)
TEST_COMPATIBILITY_CASE("different remove flag bit", remove_flag_bit, 8, 16, false)
TEST_COMPATIBILITY_CASE("different use reorder", use_reorder, true, false, false)
TEST_COMPATIBILITY_CASE(
"different precise codes io type", precise_codes_io_type, "memory_io", "block_memory_io", true)
TEST_COMPATIBILITY_CASE("different precise codes quantization type",
precise_codes_quantization_type,
"fp32",
"sq8",
false)
TEST_COMPATIBILITY_CASE("different use attribute filter", use_attribute_filter, true, false, false)
TEST_COMPATIBILITY_CASE("different support duplicate", support_duplicate, true, false, false)
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section appears to have lost formatting (braces/newlines/indentation) and now violates the repository's clang-format / style expectations, making the test hard to read and likely to fail formatting checks. Please reformat this block (e.g., run make fmt).

Suggested change
TEST_CASE("HGraph Parameters CheckCompatibility", "[ut][HGraphParameter][CheckCompatibility]"){
SECTION("wrong parameter type"){HGraphDefaultParam default_param;
auto param_str = generate_hgraph_param(default_param);
auto param = std::make_shared<vsag::HGraphParameter>();
param->FromString(param_str);
REQUIRE(param->CheckCompatibility(param));
REQUIRE_FALSE(param->CheckCompatibility(std::make_shared<vsag::EmptyParameter>()));
}
TEST_COMPATIBILITY_CASE(
"different base codes io type", base_codes_io_type, "memory_io", "block_memory_io", true)
TEST_COMPATIBILITY_CASE("different pq dim", base_codes_pq_dim, 8, 16, false)
TEST_COMPATIBILITY_CASE(
"different base codes quantization type", base_codes_quantization_type, "sq4", "sq8", false)
TEST_COMPATIBILITY_CASE("different graph type", graph_storage_type, "flat", "compressed", false)
TEST_COMPATIBILITY_CASE("different max degree", max_degree, 26, 30, false)
TEST_COMPATIBILITY_CASE("different support remove", support_remove, true, false, false)
TEST_COMPATIBILITY_CASE("different remove flag bit", remove_flag_bit, 8, 16, false)
TEST_COMPATIBILITY_CASE("different use reorder", use_reorder, true, false, false)
TEST_COMPATIBILITY_CASE(
"different precise codes io type", precise_codes_io_type, "memory_io", "block_memory_io", true)
TEST_COMPATIBILITY_CASE("different precise codes quantization type",
precise_codes_quantization_type,
"fp32",
"sq8",
false)
TEST_COMPATIBILITY_CASE("different use attribute filter", use_attribute_filter, true, false, false)
TEST_COMPATIBILITY_CASE("different support duplicate", support_duplicate, true, false, false)
TEST_CASE("HGraph Parameters CheckCompatibility", "[ut][HGraphParameter][CheckCompatibility]") {
SECTION("wrong parameter type") {
HGraphDefaultParam default_param;
auto param_str = generate_hgraph_param(default_param);
auto param = std::make_shared<vsag::HGraphParameter>();
param->FromString(param_str);
REQUIRE(param->CheckCompatibility(param));
REQUIRE_FALSE(param->CheckCompatibility(std::make_shared<vsag::EmptyParameter>()));
}
TEST_COMPATIBILITY_CASE("different base codes io type",
base_codes_io_type,
"memory_io",
"block_memory_io",
true)
TEST_COMPATIBILITY_CASE("different pq dim", base_codes_pq_dim, 8, 16, false)
TEST_COMPATIBILITY_CASE("different base codes quantization type",
base_codes_quantization_type,
"sq4",
"sq8",
false)
TEST_COMPATIBILITY_CASE("different graph type", graph_storage_type, "flat", "compressed", false)
TEST_COMPATIBILITY_CASE("different max degree", max_degree, 26, 30, false)
TEST_COMPATIBILITY_CASE("different support remove", support_remove, true, false, false)
TEST_COMPATIBILITY_CASE("different remove flag bit", remove_flag_bit, 8, 16, false)
TEST_COMPATIBILITY_CASE("different use reorder", use_reorder, true, false, false)
TEST_COMPATIBILITY_CASE("different precise codes io type",
precise_codes_io_type,
"memory_io",
"block_memory_io",
true)
TEST_COMPATIBILITY_CASE("different precise codes quantization type",
precise_codes_quantization_type,
"fp32",
"sq8",
false)
TEST_COMPATIBILITY_CASE(
"different use attribute filter", use_attribute_filter, true, false, false)
TEST_COMPATIBILITY_CASE("different support duplicate", support_duplicate, true, false, false)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reformatted the compatibility test block and re-ran make fmt in a follow-up commit.

DeserializeFromLegacyFormat(StreamReader& reader, size_t total_size) override;

void
Resize(InnerIdType new_size) override {
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resize(InnerIdType new_size) has an unused parameter, which can trigger -Wunused-parameter warnings. Consider omitting the parameter name in the definition or explicitly casting to void (as done elsewhere in this file) to keep builds warning-free.

Suggested change
Resize(InnerIdType new_size) override {
Resize(InnerIdType) override {

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the inline no-op override to omit the unused parameter name.

Copilot AI review requested due to automatic review settings April 3, 2026 09:00
@inabao inabao force-pushed the pyramid-duplicate branch from 323953c to 3cd4c04 Compare April 3, 2026 09:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 36 changed files in this pull request and generated 2 comments.

deleted_ids_(allocator),
compress_duplicate_data_(compress_redundant_data),
duplicate_ids_(0, allocator),
hole_list_(0, allocator) {
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compress_redundant_data constructor parameter is now unused (and not explicitly ignored), which can trigger -Wunused-parameter warnings in builds with warnings enabled. Either remove the parameter (and update call sites) or explicitly mark it unused (e.g., (void)compress_redundant_data;) to keep builds warning-clean.

Suggested change
hole_list_(0, allocator) {
hole_list_(0, allocator) {
(void)compress_redundant_data;

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 2e80fd7. The constructor now explicitly ignores compress_redundant_data with (void)compress_redundant_data; to keep warning-enabled builds clean.

Comment on lines +147 to +166
void
LabelTable::Deserialize(StreamReader& reader) {
StreamReader::ReadVector(reader, label_table_);
if (use_reverse_map_) {
for (InnerIdType id = 0; id < label_table_.size(); ++id) {
this->label_remap_[label_table_[id]] = id;
}
}

if (is_legacy_duplicate_format_ && duplicate_tracker_ != nullptr) {
duplicate_tracker_->DeserializeFromLegacyFormat(reader, label_table_.size());
}
is_legacy_duplicate_format_ = false;

if (support_tombstone_) {
StreamReader::ReadObj(reader, deleted_ids_);
}

this->total_count_.store(label_table_.size());
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LabelTable::Deserialize adds a legacy-duplicate parsing branch (is_legacy_duplicate_format_ + DeserializeFromLegacyFormat), but there is no unit/functional test that exercises this path end-to-end (i.e., a stream containing the legacy duplicate payload being read into the tracker and leaving the reader positioned correctly for subsequent data). Please add a regression test (likely in src/impl/label_table_test.cpp) that constructs a legacy-formatted buffer and validates that the attached duplicate tracker is populated as expected after Deserialize.

Copilot generated this review using guidance from repository custom instructions.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 2e80fd7. Added LabelTable deserializes legacy duplicate payload in src/impl/label_table_test.cpp, which builds a legacy-formatted stream, deserializes it through LabelTable, and verifies the attached DenseDuplicateTracker is populated as expected.

@inabao inabao self-assigned this Apr 7, 2026
Copilot AI review requested due to automatic review settings April 8, 2026 05:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 36 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/algorithm/hgraph.cpp:1463

  • Legacy duplicate format handling looks incomplete: when dup_version == 0 you set label_table_->is_legacy_duplicate_format_ = true, but bottom_graph_->Deserialize(buffer_reader) will still attempt to deserialize a graph-level duplicate tracker if one was created (e.g., via bottom_graph_param->support_duplicate_). Legacy indexes didn’t serialize duplicate tracking in the graph section, so this will shift the reader and corrupt all subsequent deserialization.

Suggested fix: for dup_version == 0, ensure the bottom graph has no duplicate tracker during bottom_graph_->Deserialize, and instead deserialize the legacy duplicate payload into a temporary tracker via LabelTable::Deserialize, then attach that tracker to bottom_graph_ after the graph has been deserialized.

        int64_t dup_version = 0;
        if (metadata->Get("duplicate_format_version").IsNumberInteger()) {
            dup_version = metadata->Get("duplicate_format_version").GetInt();
        }
        this->label_table_->is_legacy_duplicate_format_ = (dup_version == 0);

        this->deserialize_label_info(buffer_reader);

        this->basic_flatten_codes_->Deserialize(buffer_reader);
        this->bottom_graph_->Deserialize(buffer_reader);
        if (this->use_reorder_) {

Comment on lines +127 to +129
if (this->bottom_graph_param != nullptr) {
this->bottom_graph_param->support_duplicate_ = this->support_duplicate;
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propagating the top-level support_duplicate flag into bottom_graph_param->support_duplicate_ here can break deserialization of legacy duplicate indexes. For legacy serialized files, duplicate payload lives in the label-table section (not the graph section), so enabling graph-level duplicate tracking causes GraphInterface::Deserialize() to consume bytes that actually belong to the graph storage and corrupt the stream.

Suggested fix: don’t force bottom_graph_param->support_duplicate_ from the top-level flag in FromJson(). Instead, rely on the explicit {GRAPH_KEY: {support_duplicate: ...}} mapping (added in map_hgraph_param) and/or gate graph-level duplicate support using the footer duplicate_format_version during HGraph::Deserialize so legacy files keep graph duplicate tracking disabled during graph deserialization.

Suggested change
if (this->bottom_graph_param != nullptr) {
this->bottom_graph_param->support_duplicate_ = this->support_duplicate;
}

Copilot uses AI. Check for mistakes.
Comment on lines 94 to +112
Serialize(StreamWriter& writer) {
StreamWriter::WriteObj(writer, this->total_count_);
StreamWriter::WriteObj(writer, this->max_capacity_);
StreamWriter::WriteObj(writer, this->maximum_degree_);

if (duplicate_tracker_) {
duplicate_tracker_->Serialize(writer);
}
}

virtual void
Deserialize(StreamReader& reader) {
StreamReader::ReadObj(reader, this->total_count_);
StreamReader::ReadObj(reader, this->max_capacity_);
StreamReader::ReadObj(reader, this->maximum_degree_);

if (duplicate_tracker_) {
duplicate_tracker_->Deserialize(reader);
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GraphInterface::Serialize/Deserialize conditionally writes/reads the duplicate tracker payload based solely on whether duplicate_tracker_ is non-null, but the stream format doesn’t record a presence/version flag. If a caller constructs a graph with a tracker (e.g., because parameters default support_duplicate_ to true) and then deserializes an older graph stream that never wrote tracker bytes, Deserialize() will read the wrong bytes and corrupt the stream.

Suggested fix: encode an explicit boolean/version before the tracker payload (or gate tracker creation/deserialization using a higher-level format version like duplicate_format_version) so the reader can safely skip tracker bytes for legacy streams.

Copilot uses AI. Check for mistakes.
inabao added 2 commits April 9, 2026 12:25
Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
Align dense and sparse duplicate trackers around canonical group ids so sparse storage can use a single ring map instead of dual indexes. Preserve legacy duplicate payload compatibility via DeserializeFromLegacyFormat and add regression coverage for tracker and graph wrapper behavior.

Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
@inabao inabao force-pushed the pyramid-duplicate branch from 2e80fd7 to 716708c Compare April 9, 2026 05:15
@mergify mergify bot added the module/docs label Apr 9, 2026
Remove local planning/spec documents and restore the Pyramid example so the PR only carries the duplicate-tracker implementation changes.

Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
Copilot AI review requested due to automatic review settings April 9, 2026 05:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 36 changed files in this pull request and generated 3 comments.

Comment on lines +225 to +226
auto base = MakeDenseDataset(
{shared_vector, shared_vector, other_vector}, {101, 102, 103}, {"a/d/f", "a/d/f", "b/e/g"});
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several newly added lines exceed the repo's clang-format ColumnLimit: 100 (e.g., this MakeDenseDataset(...) invocation). Please run make fmt or manually wrap the arguments onto multiple lines so the file stays format-clean and CI formatting checks don't fail.

Copilot uses AI. Check for mistakes.
std::copy(vector.begin(), vector.end(), raw_vector);
raw_path[0] = path;

dataset->NumElements(1)->Dim(4)->Float32Vectors(raw_vector)->Paths(raw_path)->Owner(true);
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fluent call chain is longer than clang-format's configured ColumnLimit and will likely be wrapped by make fmt. Consider splitting the chained setters across multiple lines to keep the test file consistently formatted.

Copilot uses AI. Check for mistakes.
Comment on lines 768 to +773
InnerSearchParam search_param;
search_param.ef = ef_construction_;
search_param.topk = static_cast<int64_t>(ef_construction_);
search_param.search_mode = KNN_SEARCH;
if (label_table_->CompressDuplicateData()) {
search_param.hops_limit = 10000; // Add hops limit to prevent infinite loop
if (support_duplicate_) {
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search_param.hops_limit is set to the magic constant 10000. To make the behavior easier to tune and reason about, consider extracting this into a named constant (or wiring it from existing search/index parameters) so it's clear why this value is appropriate and can be adjusted without editing code.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@LHT129 LHT129 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LHTM

@inabao inabao merged commit 8999545 into main Apr 9, 2026
20 checks passed
@inabao inabao deleted the pyramid-duplicate branch April 9, 2026 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants