Skip to content

Indexing proposal follow upΒ #222

@ShaMan123

Description

@ShaMan123

Description πŸ“

Follow up on indexing proposal #164 after experimentation.

related #214 #215

Performance

Using Map seems to be causing a significant performance impact.
I think it is best if we create a specialized interface on top of the raw index IndexReader that avoids memory allocation as much as possible.

Combining Data Types

There is no way to combine string and number values causing redundant index/keys declarations. Consider the following:

   let map: Map<string, { type: number, weight: number, count: number, tag: string}>


      {
        type: EditRequestType.CREATE_INDEX,
        data: {
          name: "data",
          keys: [...map.values()],
          values: [...map.values()].flatMap(({ type, weight, count, tag }) => [
            type,
            Math.ceil((weight.value ?? 0) * WEIGHT_ROUNDING_FACTOR),
            count,
            // tag // can't be combined with numbers
          ]),
          end: Array.from({ length: map.size }).map(
            (v, index) => (index + 1) * 3
          ),
        },
      },
      {
        type: EditRequestType.CREATE_INDEX,
        data: {
          name: "tags",
          keys: [...map.values()],
          values: [...map.values()].map(({ tag }) => tag),
        },
      },

Using Transferable Data Types

Should we expose a method getIndex(name) that retrieves the entire index?
Should we build a transferable ArrayBuffer and expose an interface on top of it for reading IndexReader so that it can be messaged back to the main thread without being cloned?

**code**

worker

getIndex(name: string): { data: { keys: ArrayBuffer; values: ArrayBuffer | null; keysLength: number; valuesLength: number; keyType: "string" | "number"; valueType: "string" | "number" | "none"; mode: IndexMode; }; transfers: ArrayBuffer[]; } | null {
  const entry = this.resolve(name);
  if (!entry) return null;

  const { source, info } = entry;
  const { mode, keyType, valueType, size } = info;

  // ── Keys ────────────────────────────────────────────────────────────────
  // Encode string keys as a UTF-8 byte buffer (null-terminated per entry)
  // so the ArrayBuffer is transferable. Number keys are a plain Uint32Array.

  let keysBuffer: ArrayBuffer;
  let keysLength = size;

  if (keyType === "number") {
    const arr = new Uint32Array(size);
    for (let i = 0; i < size; i++) arr[i] = this.readNumberKey(source, i) ?? 0;
    keysBuffer = arr.buffer;
  } else {
    const encoder = new TextEncoder();
    const parts: Uint8Array[] = [];
    for (let i = 0; i < size; i++) {
      parts.push(encoder.encode((this.readStringKey(source, i) ?? "") + "\0"));
    }
    const total = parts.reduce((n, p) => n + p.byteLength, 0);
    const buf = new Uint8Array(total);
    let offset = 0;
    for (const p of parts) { buf.set(p, offset); offset += p.byteLength; }
    keysBuffer = buf.buffer;
  }

  // ── Values ──────────────────────────────────────────────────────────────

  let valuesBuffer: ArrayBuffer | null = null;
  let valuesLength = 0;

  if (mode === "oneToOne") {
    if (valueType === "number") {
      const arr = new Uint32Array(size);
      for (let i = 0; i < size; i++) arr[i] = this.readNumberValueAt(source, i) ?? 0;
      valuesBuffer = arr.buffer;
      valuesLength = size;
    } else if (valueType === "string") {
      const encoder = new TextEncoder();
      const parts: Uint8Array[] = [];
      for (let i = 0; i < size; i++) {
        parts.push(encoder.encode((this.readStringValueAt(source, i) ?? "") + "\0"));
      }
      const total = parts.reduce((n, p) => n + p.byteLength, 0);
      const buf = new Uint8Array(total);
      let offset = 0;
      for (const p of parts) { buf.set(p, offset); offset += p.byteLength; }
      valuesBuffer = buf.buffer;
      valuesLength = size;
    }
  } else if (mode === "oneToNLinear" || mode === "oneToNNonLinear") {
    if (valueType === "number") {
      // Copy the entire flat value pool β€” sliceBounds offsets index into it.
      const src = this.numberValuesArray(source);
      const copy = src ? new Uint32Array(src) : new Uint32Array(0);
      valuesBuffer = copy.buffer;
      valuesLength = copy.length;
    } else if (valueType === "string") {
      const encoder = new TextEncoder();
      const [, lastEnd] = this.sliceBounds(entry, size - 1);
      const parts: Uint8Array[] = [];
      for (let i = 0; i < lastEnd; i++) {
        parts.push(encoder.encode((this.readStringValueAt(source, i) ?? "") + "\0"));
      }
      const total = parts.reduce((n, p) => n + p.byteLength, 0);
      const buf = new Uint8Array(total);
      let offset = 0;
      for (const p of parts) { buf.set(p, offset); offset += p.byteLength; }
      valuesBuffer = buf.buffer;
      valuesLength = parts.length;
    }
  }

  // ── Transfer list ────────────────────────────────────────────────────────

  const transfers: ArrayBuffer[] = [keysBuffer];
  if (valuesBuffer) transfers.push(valuesBuffer);

  return {
    data: { keys: keysBuffer, values: valuesBuffer, keysLength, valuesLength, keyType, valueType, mode },
    transfers,
  };
}

main thread

function decodeIndex(data: ReturnType<VirtualIndexesController["getIndex"]>["data"]) {
  const { keys, values, keysLength, valuesLength, keyType, valueType, mode } = data;
  const decoder = new TextDecoder();

  // Decode keys
  const decodedKeys: (string | number)[] = keyType === "number"
    ? Array.from(new Uint32Array(keys))
    : decoder.decode(keys).split("\0").slice(0, keysLength);

  // Decode values
  const map = new Map<string | number, IndexEntry>();
  if (mode === "keysOnly" || !values) {
    for (const key of decodedKeys) map.set(key, null);
    return map;
  }

  const decodedValues: (string | number)[] = valueType === "number"
    ? Array.from(new Uint32Array(values))
    : decoder.decode(values).split("\0").slice(0, valuesLength);

  for (let i = 0; i < decodedKeys.length; i++) {
    map.set(decodedKeys[i], decodedValues[i] ?? null);
  }
  return map;
}

Index Validation

related #221

I have reached OOM and caused the browser to crash without any error messages due to an incorrect end vector that had values that exceeded the size of values.
Since the error is not transparent, and the construction of start/end is not straight forward, I would like to suggest validating start/end against the length of values, as well against as each other, throwing if out of bounds.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions