-
Notifications
You must be signed in to change notification settings - Fork 1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: add back-pressure and cancellation blog (#479)
* docs: add back-pressure and cancellation blog * fix(doc): format and content improvement for the back-pressure blog (#496) * Minor editorial changes on cancellation blog. * adjust blog structure * rename blog title --------- Co-authored-by: Wang Zixiao <[email protected]> Co-authored-by: Meng Zhang <[email protected]>
- Loading branch information
1 parent
2171ba7
commit 6348018
Showing
6 changed files
with
161 additions
and
0 deletions.
There are no files selected for viewing
3 changes: 3 additions & 0 deletions
3
website/blog/2023-09-30-stream-laziness-in-tabby/cancellation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
139 changes: 139 additions & 0 deletions
139
website/blog/2023-09-30-stream-laziness-in-tabby/index.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,139 @@ | ||
--- | ||
authors: [wwayne, gyxlucy, meng] | ||
|
||
tags: [tech design] | ||
--- | ||
|
||
# Stream laziness in Tabby | ||
|
||
This blog focuses on understanding stream laziness in Tabby. You do not need to know this information to use the Tabby, but for those interested, it offers a deeper dive on why and how the Tabby handle its LLM workload. | ||
|
||
## What is streaming? | ||
Let's begin by setting up a simple example program: | ||
|
||
![intro](./intro.png) | ||
|
||
```javascript | ||
const express = require('express'); | ||
|
||
function sleep(ms) { | ||
return new Promise(resolve => setTimeout(resolve, ms)); | ||
} | ||
|
||
async function* llm() { | ||
let i = 1; | ||
while (true) { | ||
console.log(`producing ${i}`); | ||
yield i++; | ||
|
||
// Mimic LLM inference latency. | ||
await sleep(1000); | ||
} | ||
} | ||
|
||
function server(llm) { | ||
const app = express(); | ||
app.get('/', async (req, res) => { | ||
res.writeHead(200, { | ||
'Content-Type': 'application/jsonstream', | ||
'Transfer-Encoding': 'chunked', | ||
}); | ||
|
||
let value, done; | ||
do { | ||
({ value, done } = await llm.next()); | ||
res.write(JSON.stringify(value)); | ||
res.write('\n'); | ||
} while (!done); | ||
}); | ||
|
||
app.listen(8080); | ||
} | ||
|
||
async function consumer() { | ||
const resp = await fetch('http://localhost:8080'); | ||
|
||
// Read values from our stream | ||
const reader = resp.body.pipeThrough(new TextDecoderStream()).getReader(); | ||
// We're only reading 3 items this time: | ||
for (let i = 0; i < 3; i++) { | ||
// we know our stream is infinite, so there's no need to check `done`. | ||
const { value } = await reader.read(); | ||
console.log(`read ${value}`); | ||
} | ||
} | ||
|
||
server(llm()); | ||
consumer(); | ||
``` | ||
|
||
## Stream Laziness | ||
|
||
If you were to run this program, you'd notice something interesting. We'll observe the LLM continuing to output `producing ${i}` even after the consumer has finished reading three times. This might seem obvious, given that the LLM is generating an infinite stream of integers. However, it represents a problem: our server must maintain an ever-expanding queue of items that have been pushed in but not pulled out. | ||
|
||
Moreover, the workload involved in creating the stream is typically both expensive and time-consuming, such as computation workload on the GPU. But what if the client aborts the in-flight request due to a network issue or other intended behaviors? | ||
|
||
This is where the concept of stream laziness comes into play. We should perform computations only when the client requests them. If the client no longer needs a response, we should halt production and pause the stream, thereby saving valuable GPU resources. | ||
|
||
![Cancellation](./cancellation.png) | ||
|
||
## How to handle canellation? | ||
|
||
The core idea is straightforward: on the server side, we need to listen to the `close` event and check if the connection is still valid before pulling data from the LLM stream. | ||
|
||
```js | ||
app.get('/', async (req, res) => { | ||
... | ||
let canceled; | ||
req.on('close', () => canceled = true); | ||
do { | ||
({ value, done } = await llm.next()); | ||
... | ||
} while (!done && !canceled); | ||
}); | ||
``` | ||
|
||
## Implement cancellation for Tabby | ||
|
||
In Tabby, effective management of code completion cancellations is crucial for promptly responding to users' inputs while optimizing model usage to enhance performance. | ||
|
||
On the client side, whenever we receive a new input from a user, it's essential to abort the previous query and promptly retrieve a new response from the server. | ||
|
||
```js | ||
// Demo code in the client side | ||
|
||
let controller; | ||
|
||
const callServer = (prompt) => { | ||
controller = new AbortController(); | ||
const signal = controller.signal; | ||
// 2. calling server API to get the result with the prompt | ||
const response = await fetch("/v1/completions", { | ||
method: "POST", | ||
headers: { | ||
"Content-Type": "application/json" | ||
}, | ||
body: JSON.stringify({ prompt }) | ||
signal | ||
}); | ||
} | ||
|
||
const onChange = (e) => { | ||
if (controller) controller.abort(); // Abort the previous request | ||
callServer(e.target.value); | ||
}; | ||
|
||
// 1. Debounce the input 100ms for example | ||
<input onChange={debounce(onChange)} /> | ||
|
||
``` | ||
|
||
By employing streaming and implementing laziness semantics appropriately, all components operate smoothly and efficiently! | ||
|
||
![Streaming](./stream.png) | ||
|
||
## That's it | ||
|
||
We would love to invite to join our Slack community! Please feel free to reach out to us on [Slack](https://join.slack.com/t/tabbycommunity/shared_invite/zt-1xeiddizp-bciR2RtFTaJ37RBxr8VxpA) - we have channels for discussing all aspects of the product and tech, and everyone is welcome to join the conversation. | ||
|
||
Happy hacking 😁💪🏻 |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions
3
website/blog/2023-09-30-stream-laziness-in-tabby/stream-flush.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters