Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create QG. #176

Open
fgheng opened this issue Dec 18, 2024 · 5 comments
Open

How to create QG. #176

fgheng opened this issue Dec 18, 2024 · 5 comments

Comments

@fgheng
Copy link

fgheng commented Dec 18, 2024

Hello, I am using NGT-qg for vector recall, and I found that there are two ways to convert ONNG to QG:

Command line: qbg command [option] index [data]
C++ code: NGTQG::Index::quantize(indexPath, dimensionOfSubvector, maxNumberOfEdges, true); which from samples/qg-l2-float

However, I noticed that when using the qbg command line to perform QG on ONNG, it involves two steps:

  1. qbg create-qg
  2. qbg build-qg

When I checked the C++ source code corresponding to these two commands, I found that they perform the following operations:

// lib/NGT/NGTQ/QbgCli.cpp::createQG
void QBG::CLI::createQG(NGT::Args &args) {
  const std::string usage = "Usage: qbg create-qg [-Q dimension-of-subvector] index";

  QbgCliBuildParameters buildParameters(args);
  buildParameters.getCreationParameters();

  string indexPath;
  try {
    indexPath = args.get("#1");
  } catch (...) {
    std::stringstream msg;
    msg << "No index is specified." << std::endl;
    msg << usage << endl;
    NGTThrowException(msg);
  }
  std::cerr << "creating..." << std::endl;
  NGTQG::Index::create(indexPath, buildParameters);
  std::cerr << "appending..." << std::endl;
  NGTQG::Index::append(indexPath, buildParameters);
}
// lib/NGT/NGTQ/QbgCli.cpp::buildQG
void QBG::CLI::buildQG(NGT::Args &args) {
  const std::string usage = "Usage: qbg build-qg [-Q dimension-of-subvector] [-E max-number-of-edges] index";

  QbgCliBuildParameters buildParameters(args);
  buildParameters.getBuildParameters();

  string indexPath;
  try {
    indexPath = args.get("#1");
  } catch (...) {
    std::stringstream msg;
    msg << "No index is specified." << std::endl;
    msg << usage << std::endl;
    NGTThrowException(msg);
  }

  size_t phase         = args.getl("p", 0);
  size_t maxNumOfEdges = args.getl("E", 128);

  const std::string qgPath = indexPath + "/qg";

  if (phase == 0 || phase == 1) {
    QBG::Optimizer optimizer(buildParameters);
    optimizer.globalType = QBG::Optimizer::GlobalTypeZero;

#ifdef NGTQG_NO_ROTATION
    if (optimizer.rotation || optimizer.repositioning) {
      std::cerr << "build-qg: Warning! Although rotation or repositioning is specified, turn off rotation "
                   "and repositioning because of unavailable options."
                << std::endl;
      optimizer.rotation      = false;
      optimizer.repositioning = false;
    }
#endif

    std::cerr << "optimizing..." << std::endl;
    optimizer.optimize(qgPath);
  }
  auto verbose = buildParameters.optimization.verbose;
  if (phase == 0 || phase == 2) {
    std::cerr << "building the inverted index..." << std::endl;
    QBG::Index::buildNGTQ(qgPath, verbose);
  }
  if (phase == 0 || phase == 3) {
    std::cerr << "building the quantized graph... " << std::endl;
    NGTQG::Index::realign(indexPath, maxNumOfEdges, verbose);
  }
}

It seems that when using the command line to perform QG on ONNG, the operations involved are more extensive and mainly include:

  // createQG
  NGTQG::Index::create(indexPath, buildParameters);
  NGTQG::Index::append(indexPath, buildParameters);

  // buildQG
  NGTQG::Index::realign(indexPath, maxNumOfEdges, verbose);

This differs from the method implemented in samples/qg-l2-float. What is the difference here?

I would appreciate any help or insights you can provide on this issue.

@masajiro
Copy link
Member

masajiro commented Dec 18, 2024

Hello,
These two are basically the same in functionality. First, I created the command-line tool. Later, when I added API functionality and the sample code, I consolidated the functionality of the command-line tool into the function quantize() for better usability. Inside the function quantize(), it calls the same functions as the command-line tool. The command-line tool allows for more detailed configurations.

@fgheng
Copy link
Author

fgheng commented Dec 19, 2024

Thank you for your response. I still have a few questions I need your help with.

I wrote a piece of code to test the speed of QG as follows:

  1. I created an ANNG graph using the L2 distance
  2. converted the ANNG graph to an ONNG
  3. applying the quantizer method to perform QG on the ONNG
  4. I retrieved the top 100 results for 1000 queries
#include <iostream>
#include <chrono>
#include "NGT/Index.h"
#include "NGT/GraphOptimizer.h"
#include "NGT/NGTQ/QuantizedGraph.h"

int load_data_from_file(const std::string& file_path, std::vector<uint64_t>& labels, std::vector<uint8_t>& vecs, const size_t vector_size, const size_t count) {
    std::string line;

    std::cout << "count is " << count << std::endl;
    std::cout << "load data from file: " << file_path << std::endl;
    std::ifstream data(file_path);
    size_t c = 0;
    size_t invalid_count = 0;
    if (data.is_open()) {
        while (std::getline(data, line) && c < count) {
            std::vector<std::string> fields;
            size_t pos = 0;
            std::string delimiter = "\t";
            while ((pos = line.find(delimiter)) != std::string::npos) {
                fields.push_back(line.substr(0, pos));
                line.erase(0, pos + delimiter.length());
            }
            fields.push_back(line);

            if (fields.size() != 2) {
                invalid_count++;
                continue;
            }

            uint64_t label = std::stoull(fields[0]);
            std::string emb_str = fields[1];

            std::stringstream ss(emb_str);
            std::vector<uint8_t> vec(vector_size);
            std::string token;
            float* f_vec = (float*)vec.data();
            while(getline(ss, token, ',') && f_vec < (float*)(vec.data() + vector_size)) {
                // vec.push_back(std::stof(token));
                *f_vec = std::stof(token);
                f_vec++;
            }

            labels.push_back(label);

            vecs.insert(vecs.end(), vec.begin(), vec.end());
            c++;
        }

    } else {
        std::cout << "Unable to open file" << std::endl;
        return 1;
    }

    std::cout << "Invalid count is " << invalid_count << std::endl;
    std::cout << "Load data size " << c << std::endl;
    data.close();
    return 0;
}

int main(int argc, char **argv)
{
    // craete anng index
    NGT::Property property;
    property.dimension = 128;
    property.objectType           = NGT::ObjectSpace::ObjectType::Float;
    property.distanceType         = NGT::Index::Property::DistanceType::L2;
    property.edgeSizeForCreation  = 200;

    std::string anng_path("cpp_anng-100");
    NGT::Index::create(anng_path, property);
    NGT::Index    index(anng_path);

    // loading data from file
    // id1\tf1,f2,f3,f4,...f128
    // id2\tf1,f2,f3,f4,...f128
    // ...
    std::string data_path = "../../data/deal/float_data_600k_deal.csv";
    std::vector<uint8_t> object;
    std::vector<uint64_t> labels;
    size_t vec_size = 128*sizeof(float);
    size_t vec_num = 600000;
    load_data_from_file(data_path, labels, object, vec_size, vec_num);

    for (size_t i = 0; i < vec_num; i++) {
        float* d = (float*)(object.data() + i*vec_size);
        std::vector<float> vec;
        for (int j = 0; j < 128; j++) {
            vec.push_back(*(d+j));
        }
        index.append(vec);
    }

    index.createIndex(16);
    index.save();

    // // optimizer search paramters
    // NGT::GraphOptimizer graphOptimizer1(false);
    // graphOptimizer1.setProcessingModes(false, true, true, true);
    // graphOptimizer1.optimizeSearchParameters(anng_path);

    // // refine
    // std::string ranng_path = "cpp_ranng-100";
    // NGT::Index    index_refine(anng_path);
    // NGT::GraphReconstructor::refineANNG(index_refine, false);
    // index_refine.save(ranng_path);

    // trans anng to onng
    std::string onng_path = "cpp_onng-100";
    NGT::GraphOptimizer   graphOptimizer(false);
    int numOfOutgoingEdges = 10;
    int numOfIncomingEdges = 200;
    int numOfQueries = 200;
    int numOfResultantObjects = 100; // k
    graphOptimizer.set(numOfOutgoingEdges, numOfIncomingEdges, numOfQueries, numOfResultantObjects);
    graphOptimizer.execute(anng_path, onng_path);

    // PG
    size_t dimensionOfSubvector = 1;
    size_t maxNumberOfEdges = 50;

    try {
        std::cout << "quantizing index ..." << std::endl;
        NGTQG::Index::quantize(onng_path, dimensionOfSubvector, maxNumberOfEdges, true);
    } catch (NGT::Exception &err) {
        std::cout << "error: " << err.what() << std::endl;
        return 1;
    } catch (...) {
        std::cout << "error" << std::endl;
        return 1;
    }

    // search
    NGTQG::Index    index_search(onng_path, true);
    NGT::Property property2;
    index_search.getProperty(property2);

    // load query data
    std::string query_path = "../../data/gt/float_query_emb_deal_part0.csv";
    std::vector<uint8_t> object2;
    std::vector<uint64_t> labels2;
    size_t query_vec_num = 1000;
    load_data_from_file(query_path, labels2, object2, vec_size, query_vec_num);

    std::vector<std::vector<float>> querys;
    for (size_t i = 0; i < query_vec_num; i++) {
        std::vector<float> query;
        for (size_t j = 0; j < vec_size/sizeof(float); j++) {
            float value = *((float*)(object2.data()+i*vec_size)+j);
            query.push_back(value);
        }
        querys.push_back(query);
    }
    std::cout << "laod data success." << std::endl;

    auto start = std::chrono::high_resolution_clock::now();
    auto end = std::chrono::high_resolution_clock::now();
    uint64_t total = 0;

    std::cout << "ready to search" << std::endl;
    float dis = 0.0;
    float sum = 0.0;
    size_t repeat = 1;
    for (size_t r = 0; r < repeat; r++) {
        for (size_t i = 0; i < 1000; i++) {
            NGTQG::SearchQuery searchQuery(querys[i]);
            NGT::ObjectDistances results;
            searchQuery.setResults(&results);
            searchQuery.setSize(100);
            // searchQuery.setEpsilon(0.1);
            searchQuery.setExpectedAccuracy(0.9);
            start = std::chrono::high_resolution_clock::now();
            index_search.search(searchQuery);
            end = std::chrono::high_resolution_clock::now();
            total += std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
            for (size_t i = 0; i < results.size(); i++) {
                std::cout << i + 1 << "\t" << labels[results[i].id] << "\t" << results[i].distance << std::endl;
            }
        }
    }

    std::cout << "consume:\t" << total << std::endl;

    return 0;
}

I have encountered a few issues:

  1. The speed of converting from ANNG to ONNG and from ONNG to QG is very slow, taking 2-3 hours for 600,000 data points. Is this normal?
  2. The search speed is also very slow. It takes about 3 seconds to complete the search for 1000 queries, whereas using HNSW only takes around 0.2 seconds to finish the search.
  3. When I use the IP distance to build the ANNG, I always get the warning: Warning! magnitude is larger than the current max magnitude. 437 Warning! magnitude is larger than the current max magnitude, Is this normal?.
  4. Does QG currently support ANNG graphs constructed with the IP distance?

I saw in the ann-benchmark tests that NGT-QG demonstrated very strong performance, but my results seem different. Is there any mistake in my code, or are there any parameters I can adjust?

I hope you can provide some suggestions, thank you very much.

@fgheng
Copy link
Author

fgheng commented Dec 19, 2024

Thank you for your response. I still have a few questions I need your help with.

I wrote a piece of code to test the speed of QG as follows:

  1. I created an ANNG graph using the L2 distance
  2. converted the ANNG graph to an ONNG
  3. applying the quantizer method to perform QG on the ONNG
  4. I retrieved the top 100 results for 1000 queries
#include <iostream>
#include <chrono>
#include "NGT/Index.h"
#include "NGT/GraphOptimizer.h"
#include "NGT/NGTQ/QuantizedGraph.h"

int load_data_from_file(const std::string& file_path, std::vector<uint64_t>& labels, std::vector<uint8_t>& vecs, const size_t vector_size, const size_t count) {
    std::string line;

    std::cout << "count is " << count << std::endl;
    std::cout << "load data from file: " << file_path << std::endl;
    std::ifstream data(file_path);
    size_t c = 0;
    size_t invalid_count = 0;
    if (data.is_open()) {
        while (std::getline(data, line) && c < count) {
            std::vector<std::string> fields;
            size_t pos = 0;
            std::string delimiter = "\t";
            while ((pos = line.find(delimiter)) != std::string::npos) {
                fields.push_back(line.substr(0, pos));
                line.erase(0, pos + delimiter.length());
            }
            fields.push_back(line);

            if (fields.size() != 2) {
                invalid_count++;
                continue;
            }

            uint64_t label = std::stoull(fields[0]);
            std::string emb_str = fields[1];

            std::stringstream ss(emb_str);
            std::vector<uint8_t> vec(vector_size);
            std::string token;
            float* f_vec = (float*)vec.data();
            while(getline(ss, token, ',') && f_vec < (float*)(vec.data() + vector_size)) {
                // vec.push_back(std::stof(token));
                *f_vec = std::stof(token);
                f_vec++;
            }

            labels.push_back(label);

            vecs.insert(vecs.end(), vec.begin(), vec.end());
            c++;
        }

    } else {
        std::cout << "Unable to open file" << std::endl;
        return 1;
    }

    std::cout << "Invalid count is " << invalid_count << std::endl;
    std::cout << "Load data size " << c << std::endl;
    data.close();
    return 0;
}

int main(int argc, char **argv)
{
    // craete anng index
    NGT::Property property;
    property.dimension = 128;
    property.objectType           = NGT::ObjectSpace::ObjectType::Float;
    property.distanceType         = NGT::Index::Property::DistanceType::L2;
    property.edgeSizeForCreation  = 200;

    std::string anng_path("cpp_anng-100");
    NGT::Index::create(anng_path, property);
    NGT::Index    index(anng_path);

    // loading data from file
    // id1\tf1,f2,f3,f4,...f128
    // id2\tf1,f2,f3,f4,...f128
    // ...
    std::string data_path = "../../data/deal/float_data_600k_deal.csv";
    std::vector<uint8_t> object;
    std::vector<uint64_t> labels;
    size_t vec_size = 128*sizeof(float);
    size_t vec_num = 600000;
    load_data_from_file(data_path, labels, object, vec_size, vec_num);

    for (size_t i = 0; i < vec_num; i++) {
        float* d = (float*)(object.data() + i*vec_size);
        std::vector<float> vec;
        for (int j = 0; j < 128; j++) {
            vec.push_back(*(d+j));
        }
        index.append(vec);
    }

    index.createIndex(16);
    index.save();

    // // optimizer search paramters
    // NGT::GraphOptimizer graphOptimizer1(false);
    // graphOptimizer1.setProcessingModes(false, true, true, true);
    // graphOptimizer1.optimizeSearchParameters(anng_path);

    // // refine
    // std::string ranng_path = "cpp_ranng-100";
    // NGT::Index    index_refine(anng_path);
    // NGT::GraphReconstructor::refineANNG(index_refine, false);
    // index_refine.save(ranng_path);

    // trans anng to onng
    std::string onng_path = "cpp_onng-100";
    NGT::GraphOptimizer   graphOptimizer(false);
    int numOfOutgoingEdges = 10;
    int numOfIncomingEdges = 200;
    int numOfQueries = 200;
    int numOfResultantObjects = 100; // k
    graphOptimizer.set(numOfOutgoingEdges, numOfIncomingEdges, numOfQueries, numOfResultantObjects);
    graphOptimizer.execute(anng_path, onng_path);

    // PG
    size_t dimensionOfSubvector = 1;
    size_t maxNumberOfEdges = 50;

    try {
        std::cout << "quantizing index ..." << std::endl;
        NGTQG::Index::quantize(onng_path, dimensionOfSubvector, maxNumberOfEdges, true);
    } catch (NGT::Exception &err) {
        std::cout << "error: " << err.what() << std::endl;
        return 1;
    } catch (...) {
        std::cout << "error" << std::endl;
        return 1;
    }

    // search
    NGTQG::Index    index_search(onng_path, true);
    NGT::Property property2;
    index_search.getProperty(property2);

    // load query data
    std::string query_path = "../../data/gt/float_query_emb_deal_part0.csv";
    std::vector<uint8_t> object2;
    std::vector<uint64_t> labels2;
    size_t query_vec_num = 1000;
    load_data_from_file(query_path, labels2, object2, vec_size, query_vec_num);

    std::vector<std::vector<float>> querys;
    for (size_t i = 0; i < query_vec_num; i++) {
        std::vector<float> query;
        for (size_t j = 0; j < vec_size/sizeof(float); j++) {
            float value = *((float*)(object2.data()+i*vec_size)+j);
            query.push_back(value);
        }
        querys.push_back(query);
    }
    std::cout << "laod data success." << std::endl;

    auto start = std::chrono::high_resolution_clock::now();
    auto end = std::chrono::high_resolution_clock::now();
    uint64_t total = 0;

    std::cout << "ready to search" << std::endl;
    float dis = 0.0;
    float sum = 0.0;
    size_t repeat = 1;
    for (size_t r = 0; r < repeat; r++) {
        for (size_t i = 0; i < 1000; i++) {
            NGTQG::SearchQuery searchQuery(querys[i]);
            NGT::ObjectDistances results;
            searchQuery.setResults(&results);
            searchQuery.setSize(100);
            // searchQuery.setEpsilon(0.1);
            searchQuery.setExpectedAccuracy(0.9);
            start = std::chrono::high_resolution_clock::now();
            index_search.search(searchQuery);
            end = std::chrono::high_resolution_clock::now();
            total += std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
            for (size_t i = 0; i < results.size(); i++) {
                std::cout << i + 1 << "\t" << labels[results[i].id] << "\t" << results[i].distance << std::endl;
            }
        }
    }

    std::cout << "consume:\t" << total << std::endl;

    return 0;
}

I have encountered a few issues:

  1. The speed of converting from ANNG to ONNG and from ONNG to QG is very slow, taking 2-3 hours for 600,000 data points. Is this normal?
  2. The search speed is also very slow. It takes about 3 seconds to complete the search for 1000 queries, whereas using HNSW only takes around 0.2 seconds to finish the search.
  3. When I use the IP distance to build the ANNG, I always get the warning: Warning! magnitude is larger than the current max magnitude. 437 Warning! magnitude is larger than the current max magnitude.
  4. Does QG currently support ANNG graphs constructed with the IP distance?

I saw in the ann-benchmark tests that NGT-QG demonstrated very strong performance, but my results seem different. Is there any mistake in my code, or are there any parameters I can adjust?

I hope you can provide some suggestions, thank you very much.

I compile this code using

g++ -o ./ngt_qg_test ./ngt_qg_test.cc -lngt -lblas -march=native -O3 -fopenmp

and I compile ngt using

yum install blas-devel lapack-devel
unzip NGT-x.x.x.zip
cd NGT-x.x.x
mkdir build
cd build
cmake ..
make
make install
ldconfig /usr/local/lib

@masajiro
Copy link
Member

As for your questions 1 to 3, the issues might be caused by your dataset. First, I recommend trying the same operation using the NGT command-line tools (ngt and qbg). If you encounter the same situation, try adjusting the parameters.
As for question 4, QG does not support inner product yet.

@fgheng
Copy link
Author

fgheng commented Dec 20, 2024

As for your questions 1 to 3, the issues might be caused by your dataset. First, I recommend trying the same operation using the NGT command-line tools (ngt and qbg). If you encounter the same situation, try adjusting the parameters. As for question 4, QG does not support inner product yet.

thank you very much, I will retry using command-line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants