Visualizing the knowledge of Large Language AI models.
Recent developments in AI in 2022 and 2023 have led to a large number of newly released tools based on large language models, such as ChatGPT, Bing Chat or Bard. While these models are impressive, it has become increasingly clear that they don't know everything. This raises questions about the limits of AI's knowledge, the nature of latent spaces, and their accessibility.
The aim of this project is to explore new ways of visualising and exploring latent spaces and making them more accessible for the public. It does this through a web interface that provides users with an interactive platform to explore and search high-dimensional vector embeddings, visualised on a two-dimensional <canvas>
. Each high-dimensional vector embedding will be represented as a cross in the <canvas>
, and each search will generate a new cross on the <canvas>
, drawing connections between similar pieces of data in the embedding space. The resulting network of interconnected information will create visually pleasing network structures, following recent studies that suggest knowledge is best represented as rich, interconnected networks rather than linear trees.
See ✨ are.na for a collection of sources on which my research is based on.
- Fonts in use:
- Times New Roman by Stanley Morison und Victor Lardent
- Helvetica by Max Miedinger
- Technology:
- Frontend:
- Build with SvelteKit
- Canvas powered by Konva
- Modals are created with svelte-modals
- Backend:
- Build with FastAPI
- Datasets are fetched through the 🤗 Datasets library
- Creating embeddings and searching is made possible thanks to Sentence Transformers
- Frontend:
In order to make a text datasets searchable and two-dimensional, it had to be processed in two steps. First, it had to be converted into vector embeddings. As these vector embeddings are usually multi-dimensional, a second processing step must be added to reduce the information value of the multi-dimensional vector to two dimensions. By reducing the vectors to two dimensions, they can be rendered.
The Sentence Transformers library, specifically the multi-qa-MiniLM-L6-cos-v1
model, was used for vector encoding of the data sets. Truncated singular value decomposition (TruncatedSVD) was then used to reduce the information density of the vector embeddings to two dimensions. This transformation is also known as latent semantic analysis (LSA). I chose TruncatedSVD over tSNE because it always produces the same two-dimensional vectors given the same multi-dimensional vectors. The scikit-learn implementation of TruncatedSVD was used to apply this transformation.
Both embedding datasets, the multi-dimensional and the two-dimensional, were stored on Huggingface via its 🤗 Datasets library.
Different datasets should be added in the future. This issue keeps track of them: #4.
To make the dataset searchable, a function from the Sentence Transformers library called semantic search was used. For the function to work, the search term has to be encoded with the same model that was used to encode the whole dataset. In this case, the multi-qa-MiniLM-L6-cos-v1
model. Once this is done, you pass the encoded search term, the encoded dataset and the number of similar results you want to retrieve to the function.
Among other things, the function returns the index of the similar results in the dataset, which is then used to map the results to the two-dimensional representation.
This is all done on the backend server, which returns the results via a custom API built with FastAPI. At startup, the datasets and the model are downloaded to the file system using the 🤗 Datasets library. Of course, this makes the startup very time consuming, but it allows for faster computation and response time during operation.
By calling various endpoints of the custom API, the frontend can receive the two-dimensional representation, browse and retrieve individual data points from the datasets.
After receiving the embeddings from the API, they are mapped to the current screen size, as the original position values are in a range between 0 and 1. The canvas library Konva is used to render the datasets.
By calling the search endpoint, the frontend can submit a search query and get the similar results back as a response. These are then mapped to the screen size and cached locally to make the search results persistent between reloads.
Docker must be installed and running to start the development setup. Instructions on how to install Docker for your OS can be found here.
To start a development setup run:
docker compose -f docker-compose.dev.yml up -d
This starts the frontend on http://localhost:8080
and the backend on http://localhost:7100
.
Both have a watch services enabled and reload automatically, when changes to the source code were made.
To rebuild the development setup run:
docker compose -f docker-compose.dev.yml up -d --build
To rebuild only a specific container run:
docker compose -f docker-compose.dev.yml up -d --build frontend
The images for the frontend and backend are published and kept up to date on Docker Hub:
You could either use the provided docker-compose.yml
or run the containers via docker commands to start the app for production.
This is the recommended way to start the app for production.
First copy the provided .env.example to .env and adjust the environment variables.
cp .env.example .env
A good default for the environment variables is:
# CORS
FRONTEND_URL=http://frontend:80
BACKEND_URL=http://backend:7100
If you host the app on any other URL or Port make sure to adjust the environment variables accordingly.
Use this command to start knowledge spaces.
docker compose up -d
You can also use the following docker commands to start the backend and frontend container. Make sure to adjust the environment variables and ports according to your needs.
docker network create knowledge-spaces
docker run \
--name backend_knowledge-spaces \
-e FRONTEND_URL=YOUR_FRONTEND_URL \
-v $(pwd)/data:/data \
--network knowledge-spaces \
francescosch/backend_knowledge-spaces:latest
docker run \
--name frontend_knowledge-spaces \
-e PUBLIC_BACKEND_URL=YOUR_BACKEND_URL \
-p 80:3000 \
--network knowledge-spaces \
--depends-on backend \
francescosch/frontend_knowledge-spaces:latest