Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discojs-core/models: add gpt #644

Merged
merged 7 commits into from
Mar 18, 2024
Merged

Conversation

tharvik
Copy link
Collaborator

@tharvik tharvik commented Feb 29, 2024

add GPT model, tracked in #641

this is a prototype that is not being tested (as we need tokenization to get meaning out of it), only that the training of the model is reducing loss

@tharvik tharvik self-assigned this Feb 29, 2024
@tharvik tharvik force-pushed the 641-partial-merge-llm-tharvik branch 3 times, most recently from bb51e16 to 10be5f1 Compare March 1, 2024 10:16
@tharvik tharvik force-pushed the 641-merge-prototype-llm-tharvik branch from 25ff487 to bcddd72 Compare March 1, 2024 10:24
Base automatically changed from 641-partial-merge-llm-tharvik to develop March 1, 2024 14:35
@tharvik tharvik force-pushed the 641-merge-prototype-llm-tharvik branch 4 times, most recently from 682cc60 to c891efe Compare March 6, 2024 19:09
@tharvik tharvik changed the base branch from develop to 643-fixes-tharvik March 6, 2024 19:11
@tharvik tharvik force-pushed the 641-merge-prototype-llm-tharvik branch 4 times, most recently from b959e6b to 79c9878 Compare March 7, 2024 13:25
@tharvik tharvik force-pushed the 643-fixes-tharvik branch from a43e819 to 91fc143 Compare March 7, 2024 13:44
@tharvik tharvik force-pushed the 641-merge-prototype-llm-tharvik branch from 79c9878 to ca2d2c1 Compare March 7, 2024 13:45
Base automatically changed from 643-fixes-tharvik to develop March 7, 2024 13:55
@tharvik tharvik force-pushed the 641-merge-prototype-llm-tharvik branch from ca2d2c1 to 649a37c Compare March 7, 2024 13:56
@tharvik tharvik force-pushed the 641-merge-prototype-llm-tharvik branch from 7a65e03 to aa39c7a Compare March 11, 2024 14:54
@tharvik tharvik force-pushed the 641-merge-prototype-llm-tharvik branch from c9f6af1 to 1fb1b8a Compare March 13, 2024 14:17
@tharvik tharvik marked this pull request as ready for review March 13, 2024 14:27
@tharvik tharvik force-pushed the 641-merge-prototype-llm-tharvik branch from 1fb1b8a to 83e3d9e Compare March 13, 2024 14:32
Copy link
Member

@martinjaggi martinjaggi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing work, thanks!

i just left minor comments

discojs/discojs-core/src/default_tasks/wikitext.ts Outdated Show resolved Hide resolved
taskTitle: 'Wikitext 103 Raw',
summary: {
preview: 'In this challenge, we ask you to do next word prediction on a dataset of Wikipedia articles.',
overview: 'Wikitext-103-raw is a dataset comprising unprocessed text excerpts from Wikipedia articles, designed for tasks related to natural language processing and language modeling.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment what tokenizer is used (type, and pretrained tokenizer downloaded from ...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also a comment on what is to expect (like if you train alone for 5mins or 5h, you should expect this train/test loss value roughly, and that the resulting model will somehow sound english?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there isn't really tokenization happening: gpt-tokenizer is unable to be compiled in the webapp (probably due to the lack of #632). currently, it takes wikitext as a stream of characters and pass it trough GPT.convertCharDataset to have the correct shape (don't know what its purpose is but it was required to make it work).

on the duration of training, I can only say that the loss is reducing. I hope that it'll write some correct english but I've no test to prove it.

hopefully, #646 will help us make it work and to general try it out.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think running gpt-micro on Shakespeare for 1 epoch should already produce some "shakespearesque" output

discojs/discojs-core/src/default_tasks/wikitext.ts Outdated Show resolved Hide resolved

private static readonly batchSize = 4
private static readonly blockSize = 128
private static readonly vocabSize = 50258
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment somewhere on where the tokenizer came from? (from the task definition maybe?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the tokenizer should be part of the task definition or be abstracted away. Is there really a use case for Disco where one might change this parameter? (Or even has the knowledge about what a tokenizer is)

private convertCharDataset (dataset: Dataset): Dataset {
const batchSize = 4
const sampleSize = GPT.blockSize + 1
const chunkSize = sampleSize * batchSize * 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment that 2 bytes will be one token id? BTW is it much more mem overhead to use an array (or stream) of int (to represent the token stream input) than the 2 bytes here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I a not sure if my code is wrong but it is supposed to pull the exact number of bytes it needs for one batch. So chunkSize is the number of bytes in the read buffer not number of integers (4 bytes)

discojs/discojs-core/src/models/gpt/model.ts Show resolved Hide resolved
@martinjaggi martinjaggi self-requested a review March 13, 2024 15:33
Copy link
Collaborator

@JulienVig JulienVig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work thank you Valérian! I left some questions and opinions but you can proceed with the merge

DEV.md Outdated Show resolved Hide resolved
datasets/README.md Outdated Show resolved Hide resolved
discojs/discojs-core/src/aggregator/base.ts Show resolved Hide resolved
discojs/discojs-core/src/models/gpt/index.ts Show resolved Hide resolved
server/src/router/federated/server.ts Show resolved Hide resolved
@tharvik tharvik force-pushed the 641-merge-prototype-llm-tharvik branch from 708d4e7 to 09331b3 Compare March 15, 2024 12:02
@tharvik tharvik force-pushed the 641-merge-prototype-llm-tharvik branch from 09331b3 to 23039d1 Compare March 18, 2024 12:10
@tharvik tharvik merged commit 1c0e914 into develop Mar 18, 2024
23 checks passed
@tharvik tharvik deleted the 641-merge-prototype-llm-tharvik branch March 18, 2024 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants