-
Notifications
You must be signed in to change notification settings - Fork 116
Practical Examples Using Cpp
Practical examples with a large-scale dataset for two types of primary NGT graphs (ANNG, ONNG) are described.
First, to describe how to search large-scale datasets, an NGT dataset needs to be generated. Download the fastText dataset as follows.
curl -O https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.vec.zip
unzip wiki-news-300d-1M-subword.vec.zip
The dataset above should be converted to a format that our sample scripts can read by using the following python script.
# dataset.py
with open('wiki-news-300d-1M-subword.vec', 'r') as fi,\
open('objects.tsv', 'w') as fov, open('words.tsv', 'w') as fow:
n, dim = map(int, fi.readline().split())
fov.write('{0}\t{1}\n'.format(n, dim))
for line in fi:
tokens = line.rstrip().split(' ')
fow.write(tokens[0] + '\n')
fov.write('{0}\n'.format('\t'.join(tokens[1:])))
Below is an example of how to construct an ANNG with cosine similarity for metric space.
// construct-fasttext.cpp
#include "NGT/Index.h"
using namespace std;
int main(int argc, char **argv)
{
ifstream is("objects.tsv");
string line;
getline(is, line);
stringstream linestream(line);
NGT::Property property;
size_t n;
linestream >> n;
linestream >> property.dimension;
property.objectType = NGT::ObjectSpace::ObjectType::Float;
property.distanceType = NGT::Index::Property::DistanceType::DistanceTypeCosine;
string path("fasttext.anng");
NGT::Index::create(path, property);
NGT::Index index(path);
index.disableLog();
while (getline(is, line)) {
stringstream linestream(line);
vector<float> object;
while (!linestream.eof()) {
float value;
linestream >> value;
object.push_back(value);
}
index.append(object);
}
index.createIndex(16); // 16 is the number of threads to construct.
index.save();
return 0;
}
The ANNG can be searched with a query by using the following program.
// search-fasttext.cpp
#include "NGT/Index.h"
#include <sstream>
#include <iomanip>
using namespace std;
int main(int argc, char **argv)
{
ifstream is("words.tsv");
string word;
vector<string> words;
while (getline(is, word)) {
words.push_back(word);
}
NGT::Index index("fasttext.anng", true);
int queryID = 10001;
vector<float> query;
index.getObjectSpace().getObject(queryID, query);
NGT::SearchQuery searchQuery(query);
NGT::ObjectDistances results;
searchQuery.setResults(&results);
searchQuery.setSize(10);
searchQuery.setEpsilon(0.1);
index.search(searchQuery);
cout << "Query No." << queryID << endl;
cout << "Rank\tID\tDistance" << setprecision(6) << fixed << endl;
for (size_t i = 0; i < results.size(); i++) {
cout << i + 1 << "\t" << results[i].id << "\t" << results[i].distance << "\t" << words[results[i].id - 1] << endl;
}
cout << endl;
return 0;
}
Below are the search results.
Query No.10001
Rank ID Distance
1 10001 0.000000 Doctors
2 4632 0.244096 Doctor
3 79543 0.258944 Medics
4 2045 0.263412 doctors
5 339398 0.274972 Doctoring
6 20668 0.280508 Physicians
7 80647 0.292580 Dentists
8 24256 0.292752 Nurses
9 9481 0.322196 Scientists
10 623161 0.330500 Practioners
When a higher accuracy is needed, you can specify an epsilon value of SearchQuery higher than the default 0.1 as shown below.
searchQuery.setEpsilon(0.15);
When a short query time is needed at the expense of accuracy, you can specify a smaller epsilon value.
Command line tool
Python
C++