Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Card: Allow for dicts in datasets and base_model and also update spec #2479

Open
mofosyne opened this issue Aug 22, 2024 · 11 comments
Open

Comments

@mofosyne
Copy link

mofosyne commented Aug 22, 2024

Is your feature request related to a problem? Please describe.

Was working on ggerganov/llama.cpp#8875 to integrate some changes to how we interpret parent models and datasets into GGUF metadata and was alerted that your code currently interprets the datasets as only List[str] while the changes we are proposing would support these types in datasets and base_model :

  • List[str] of hugging face id
  • List[str] of urls to other repos
  • List[dict] of dict with fields like name, author, version, organization, url, doi, uuid and repo_url

Describe the solution you'd like

Update description to indicate support for urls and dict metadata in both datasets and base_model entry in model card as well as update typechecks to support dict as an option.

Describe alternatives you've considered

We already can support these extra metadata in GGUF file format via metadata override files, but it would be nice to be able to sync these feature so we can more easily grab these information from model creator's model card.

Additional context

The code area I'm looking at is

datasets (`List[str]`, *optional*):
List of datasets that were used to train this model. Should be a dataset ID
found on https://hf.co/datasets. Defaults to None.

@Wauplin
Copy link
Contributor

Wauplin commented Aug 27, 2024

Hi @mofosyne, thanks for raising the topic. Unfortunately, this is not an easy constraint to lift. It is not only a matter of type annotations but of server-side constraints. You can see it more as "naming convention" rather that a hard technical constraint. The problem of lifting this limit is that we would have to update how we consume this fields in many places in HF ecosystem. Also, since we ensure specific types for model card metadata, third-party libraries and users are also relying on us to not break things over time. Supporting both dictionaries and lists for this field would be a big breaking change unfortunately.

cc @julien-c

@julien-c
Copy link
Member

julien-c commented Sep 2, 2024

Yes agree with @Wauplin. For your use case @mofosyne you could add your own metadata property no? (and we can even add built-in support for it if a standard emerges)

mofosyne added a commit to mofosyne/llama.cpp that referenced this issue Oct 7, 2024
This is to address "Model Card: Allow for dicts in datasets and base_model and also update spec" in huggingface/huggingface_hub#2479 where we would like to add detailed metadata support for both base model and datashet but in a way that huggingface will eventually be able to support (They are currently using either a string or string list... we will be using a list of dict which would be extensible). They recommended creating a seperate metadata property for this.
mofosyne added a commit to mofosyne/llama.cpp that referenced this issue Oct 7, 2024
This is to address "Model Card: Allow for dicts in datasets and base_model and also update spec" in huggingface/huggingface_hub#2479 where we would like to add detailed metadata support for both base model and datashet but in a way that huggingface will eventually be able to support (They are currently using either a string or string list... we will be using a list of dict which would be extensible). They recommended creating a seperate metadata property for this.
@mofosyne
Copy link
Author

mofosyne commented Nov 13, 2024

Thanks. Merged in now. We will be sticking to these fields for the detailed dicts representation

  • base_model_sources (List[dict], optional)
  • dataset_sources (List[dict], optional)

So hence something like this (Note: Dummy data provided by chatgpt for illustrative purpose only):

base_model_sources:
  - name: "GPT-3"
    author: "OpenAI"
    version: "3.0"
    organization: "OpenAI"
    description: "A large language model capable of performing a wide variety of language tasks."
    url: "https://openai.com/research/gpt-3"
    doi: "10.5555/gpt3doi123456"
    uuid: "123e4567-e89b-12d3-a456-426614174000"
    repo_url: "https://github.com/openai/gpt-3"

  - name: "BERT"
    author: "Google AI Language"
    version: "1.0"
    organization: "Google"
    description: "A transformer-based model pretrained on English to achieve state-of-the-art performance on a range of NLP tasks."
    url: "https://github.com/google-research/bert"
    doi: "10.5555/bertdoi789012"
    uuid: "987e6543-e21a-43f3-a356-527614173999"
    repo_url: "https://github.com/google-research/bert"

dataset_sources:
  - name: "Wikipedia Corpus"
    author: "Wikimedia Foundation"
    version: "2021-06"
    organization: "Wikimedia"
    description: "A dataset comprising the full English Wikipedia, used to train models in a range of natural language tasks."
    url: "https://dumps.wikimedia.org/enwiki/"
    doi: "10.5555/wikidoi234567"
    uuid: "234e5678-f90a-12d3-c567-426614172345"
    repo_url: "https://github.com/wikimedia/wikipedia-corpus"

  - name: "Common Crawl"
    author: "Common Crawl Foundation"
    version: "2021-04"
    organization: "Common Crawl"
    description: "A dataset containing web-crawled data from various domains, providing a broad range of text."
    url: "https://commoncrawl.org"
    doi: "10.5555/ccdoi345678"
    uuid: "345e6789-f90b-34d5-d678-426614173456"
    repo_url: "https://github.com/commoncrawl/cc-crawl-data"

Will fill in these metadata field in the gguf key value store.

general.base_model.count
general.base_model.{id}.name
general.base_model.{id}.author
general.base_model.{id}.version
general.base_model.{id}.organization
general.base_model.{id}.description
general.base_model.{id}.url
general.base_model.{id}.doi
general.base_model.{id}.uuid
general.base_model.{id}.repo_url

general.dataset.count
general.dataset.{id}.name
general.dataset.{id}.author
general.dataset.{id}.version
general.dataset.{id}.organization
general.dataset.{id}.description
general.dataset.{id}.url
general.dataset.{id}.doi
general.dataset.{id}.uuid
general.dataset.{id}.repo_url

@julien-c
Copy link
Member

cool @mofosyne – thanks for linking ggerganov/llama.cpp#8875

Do you have models on the HF Hub using this convention already? We can add validation so the types are hinted to be correct and we monitor how usage grows

Let's track how usage grows!

@mofosyne
Copy link
Author

mofosyne commented Nov 18, 2024

The feature hasn't been advertised anywhere at this stage... Will need to figure out the documentation next.

But in the meantime, I'll also need to figure out the most canonical form that would best fit your current model card parameters. This is because our model card parser is pretty forgiving of the various ways people randomly enter their parameter. (Plus at the time I didn't realize you defined it here in the source code).

On studying your current code base, I noticed you used model_name rather than name as I would have expected. So I appended the model_* to most of the parameters except for 'license', 'tags', 'pipeline_tag' and 'language' to keep with the same pattern.

If so then this is what I think your extended model card may look like. If you change model_name to name in your side, then it would make sense to remove the 'model_*' parameter pattern. But either way works for me.

If you are happy with the above, then I'll update the documentation to match and you can sync to that when it gets popular.

# Model Card Fields
model_name: Example Model Six
model_author: John Smith
model_version: v1.0
model_organization: SparkExampleMind
model_description: This is an example of a model
model_quantized_by: Abbety Jenson
# Useful for cleanly regenerating default naming conventions
model_finetune: instruct
model_basename: llamabase
model_size_label: 8x2.3Q
# Licensing details
license: apache-2.0
license_name: 'Apache License Version 2.0, January 2004'
license_link: 'https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md'
# Model Location/ID
model_url: 'https://huggingface.co/SparkExampleMind/llamabase-8x2.3Q-instruct-v1.0-F16/blob/main/README.md'
model_doi: 'doi:10.1080/02626667.2018.1560449'
model_uuid: f18383df-ceb9-4ef3-b929-77e4dc64787c
model_repo_url: 'https://huggingface.co/SparkExampleMind/llamabase-8x2.3Q-instruct-v1.0-F16'
# Model Source If Conversion
source_model_url: 'https://huggingface.co/SparkExampleMind/llamabase-8x2.3Q-instruct-v1.0-safetensor/blob/main/README.md'
source_model_doi: 'doi:10.1080/02626667.2018.1560449'
source_model_uuid: 'a72998bf-3b84-4ff4-91c6-7a6b780507bc'
source_model_repo_url: 'https://huggingface.co/SparkExampleMind/llamabase-8x2.3Q-instruct-v1.0-safetensor'
# Model Parents (Merges, Pre-tuning, etc...)
base_model_sources:
  - name: GPT-3
    author: OpenAI
    version: '3.0'
    organization: OpenAI
    description:  A large language model capable of performing a wide variety of language tasks.
    url: 'https://openai.com/research/gpt-3'
    doi: 10.5555/gpt3doi123456
    uuid: 123e4567-e89b-12d3-a456-426614174000
    repo_url: 'https://github.com/openai/gpt-3'
  - name: BERT
    author: Google AI Language
    version: '1.0'
    organization: Google
    description: A transformer-based model pretrained on English to achieve state-of-the-art performance on a range of NLP tasks.
    url: 'https://github.com/google-research/bert'
    doi: 10.5555/bertdoi789012
    uuid: 987e6543-e21a-43f3-a356-527614173999
    repo_url: 'https://github.com/google-research/bert'
# Model Datasets Used (Training data...)
dataset_sources:
  - name: Wikipedia Corpus
    author: Wikimedia Foundation
    version: 2021-06
    organization: Wikimedia
    description: A dataset comprising the full English Wikipedia, used to train models in a range of natural language tasks.
    url: 'https://dumps.wikimedia.org/enwiki/'
    doi: 10.5555/wikidoi234567
    uuid: 234e5678-f90a-12d3-c567-426614172345
    repo_url: 'https://github.com/wikimedia/wikipedia-corpus'
  - name: Common Crawl
    author: Common Crawl Foundation
    version: 2021-04
    organization: Common Crawl
    description: A dataset containing web-crawled data from various domains, providing a broad range of text.
    url: 'https://commoncrawl.org'
    doi: 10.5555/ccdoi345678
    uuid: 345e6789-f90b-34d5-d678-426614173456
    repo_url: 'https://github.com/commoncrawl/cc-crawl-data'
# Model Content Metadata
tags:
  - text generation
  - transformer
  - llama
  - tiny
  - tiny model
pipeline_tag:
  - text-classification
language:
  - en

(Note: I also noticed that 'pipeline_tag' and 'languages' is missing an 's' at the end... but that's a nitpick)
(p.s. an idea to consider is to give brownie awards for repos with good metadata)

@Wauplin
Copy link
Contributor

Wauplin commented Nov 18, 2024

I feel that keeping model_name for consistency with the existing but then having all other fields as "raw" (author, version, organization, etc.) is better. I find the model_* everywhere to be very verbose. Also, similarly, I'd keep dataset_name but then do not prepend dataset_* everywhere.

@Wauplin
Copy link
Contributor

Wauplin commented Nov 18, 2024

Also, this new proposition in #2479 (comment) is adding way more fields to the model card than the suggestion in #2479 (comment). I think that adding base_model_sources and dataset_sources with defined specifications is fine but adding all the other fields (source_model_url, source_model, doi, model_doi, , model_uuid, model_quantized_by, model_finetune, model_organization, etc.) is too much and would bloat the model card metadata convention.

@mofosyne
Copy link
Author

mofosyne commented Nov 19, 2024

Ah I see. So the parent references should have more details for easier retrieval, but the model itself can be understood by context. Fair enough.

So if we don't dump out all the KV stuff, but just keep the stuff directly referenced in the current HF model card conventions (as defined in the python source)... plus the detailed parent model/datasets... this should look more like.

# Model Card Fields
model_name: Example Model Six
# Licensing details
license: apache-2.0
license_name: Apache License Version 2.0, January 2004
license_link: https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md
# Model Parents (Merges, Pre-tuning, etc...)
base_model_sources:
  - name: GPT-3
    author: OpenAI
    version: '3.0'
    organization: OpenAI
    description:  A large language model capable of performing a wide variety of language tasks.
    url: 'https://openai.com/research/gpt-3'
    doi: 10.5555/gpt3doi123456
    uuid: 123e4567-e89b-12d3-a456-426614174000
    repo_url: 'https://github.com/openai/gpt-3'
  - name: BERT
    author: Google AI Language
    version: '1.0'
    organization: Google
    description: A transformer-based model pretrained on English to achieve state-of-the-art performance on a range of NLP tasks.
    url: 'https://github.com/google-research/bert'
    doi: 10.5555/bertdoi789012
    uuid: 987e6543-e21a-43f3-a356-527614173999
    repo_url: 'https://github.com/google-research/bert'
# Model Datasets Used (Training data...)
dataset_sources:
  - name: Wikipedia Corpus
    author: Wikimedia Foundation
    version: '2021-06'
    organization: Wikimedia
    description: A dataset comprising the full English Wikipedia, used to train models in a range of natural language tasks.
    url: 'https://dumps.wikimedia.org/enwiki/'
    doi: 10.5555/wikidoi234567
    uuid: 234e5678-f90a-12d3-c567-426614172345
    repo_url: 'https://github.com/wikimedia/wikipedia-corpus'
  - name: Common Crawl
    author: Common Crawl Foundation
    version: '2021-04'
    organization: Common Crawl
    description: A dataset containing web-crawled data from various domains, providing a broad range of text.
    url: 'https://commoncrawl.org'
    doi: 10.5555/ccdoi345678
    uuid: 345e6789-f90b-34d5-d678-426614173456
    repo_url: 'https://github.com/commoncrawl/cc-crawl-data'
# Model Content Metadata
tags:
  - text generation
  - transformer
  - llama
  - tiny
  - tiny model
language:
  - en

#... other HF stuff here... but isn't present from the GGUF KV store...
#... so there is no direct analog... so will be omitted on llama.cpp side of the documentation...

Well @Wauplin this does indeed look a bit more compact now. FYI, this is just going to be documentation on our side for now. But just double checking that we won't be stepping on any toes. Thumbs up if all green.

(edit: removed pipeline_tag as I remembered it's not included in the GGUF)

@Wauplin
Copy link
Contributor

Wauplin commented Nov 19, 2024

Nice, I can confirm that this version is not stepping on anyone's toes! 👍
Gentle ping to @ggerganov @julien-c if you want to confirm the metadata described in makes sense to you as well so we can settle this for good.

@julien-c
Copy link
Member

proposal looks good to me, but I would, whenever possible, also include our simpler base_model (array of model ids on the Hub) and datasets (array of dataset ids on the Hub) – whenever you know them – as we already have more built-in support for those

i.e. I would use the current proposal as an extension/add-on on top of existing conventional (simpler) metadata

@mofosyne
Copy link
Author

Okay thanks. FYI placed the mapping to https://github.com/ggerganov/llama.cpp/wiki/HuggingFace-Model-Card-Metadata-Interoperability-Consideration for future reference now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants