-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model Card: Allow for dicts in datasets
and base_model
and also update spec
#2479
Comments
Hi @mofosyne, thanks for raising the topic. Unfortunately, this is not an easy constraint to lift. It is not only a matter of type annotations but of server-side constraints. You can see it more as "naming convention" rather that a hard technical constraint. The problem of lifting this limit is that we would have to update how we consume this fields in many places in HF ecosystem. Also, since we ensure specific types for model card metadata, third-party libraries and users are also relying on us to not break things over time. Supporting both dictionaries and lists for this field would be a big breaking change unfortunately. cc @julien-c |
This is to address "Model Card: Allow for dicts in datasets and base_model and also update spec" in huggingface/huggingface_hub#2479 where we would like to add detailed metadata support for both base model and datashet but in a way that huggingface will eventually be able to support (They are currently using either a string or string list... we will be using a list of dict which would be extensible). They recommended creating a seperate metadata property for this.
This is to address "Model Card: Allow for dicts in datasets and base_model and also update spec" in huggingface/huggingface_hub#2479 where we would like to add detailed metadata support for both base model and datashet but in a way that huggingface will eventually be able to support (They are currently using either a string or string list... we will be using a list of dict which would be extensible). They recommended creating a seperate metadata property for this.
Thanks. Merged in now. We will be sticking to these fields for the detailed dicts representation
So hence something like this (Note: Dummy data provided by chatgpt for illustrative purpose only): base_model_sources:
- name: "GPT-3"
author: "OpenAI"
version: "3.0"
organization: "OpenAI"
description: "A large language model capable of performing a wide variety of language tasks."
url: "https://openai.com/research/gpt-3"
doi: "10.5555/gpt3doi123456"
uuid: "123e4567-e89b-12d3-a456-426614174000"
repo_url: "https://github.com/openai/gpt-3"
- name: "BERT"
author: "Google AI Language"
version: "1.0"
organization: "Google"
description: "A transformer-based model pretrained on English to achieve state-of-the-art performance on a range of NLP tasks."
url: "https://github.com/google-research/bert"
doi: "10.5555/bertdoi789012"
uuid: "987e6543-e21a-43f3-a356-527614173999"
repo_url: "https://github.com/google-research/bert"
dataset_sources:
- name: "Wikipedia Corpus"
author: "Wikimedia Foundation"
version: "2021-06"
organization: "Wikimedia"
description: "A dataset comprising the full English Wikipedia, used to train models in a range of natural language tasks."
url: "https://dumps.wikimedia.org/enwiki/"
doi: "10.5555/wikidoi234567"
uuid: "234e5678-f90a-12d3-c567-426614172345"
repo_url: "https://github.com/wikimedia/wikipedia-corpus"
- name: "Common Crawl"
author: "Common Crawl Foundation"
version: "2021-04"
organization: "Common Crawl"
description: "A dataset containing web-crawled data from various domains, providing a broad range of text."
url: "https://commoncrawl.org"
doi: "10.5555/ccdoi345678"
uuid: "345e6789-f90b-34d5-d678-426614173456"
repo_url: "https://github.com/commoncrawl/cc-crawl-data" Will fill in these metadata field in the gguf key value store.
|
cool @mofosyne – thanks for linking ggerganov/llama.cpp#8875 Do you have models on the HF Hub using this convention already? We can add validation so the types are hinted to be correct and we monitor how usage grows Let's track how usage grows! |
The feature hasn't been advertised anywhere at this stage... Will need to figure out the documentation next. But in the meantime, I'll also need to figure out the most canonical form that would best fit your current model card parameters. This is because our model card parser is pretty forgiving of the various ways people randomly enter their parameter. (Plus at the time I didn't realize you defined it here in the source code). On studying your current code base, I noticed you used If so then this is what I think your extended model card may look like. If you change If you are happy with the above, then I'll update the documentation to match and you can sync to that when it gets popular. # Model Card Fields
model_name: Example Model Six
model_author: John Smith
model_version: v1.0
model_organization: SparkExampleMind
model_description: This is an example of a model
model_quantized_by: Abbety Jenson
# Useful for cleanly regenerating default naming conventions
model_finetune: instruct
model_basename: llamabase
model_size_label: 8x2.3Q
# Licensing details
license: apache-2.0
license_name: 'Apache License Version 2.0, January 2004'
license_link: 'https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md'
# Model Location/ID
model_url: 'https://huggingface.co/SparkExampleMind/llamabase-8x2.3Q-instruct-v1.0-F16/blob/main/README.md'
model_doi: 'doi:10.1080/02626667.2018.1560449'
model_uuid: f18383df-ceb9-4ef3-b929-77e4dc64787c
model_repo_url: 'https://huggingface.co/SparkExampleMind/llamabase-8x2.3Q-instruct-v1.0-F16'
# Model Source If Conversion
source_model_url: 'https://huggingface.co/SparkExampleMind/llamabase-8x2.3Q-instruct-v1.0-safetensor/blob/main/README.md'
source_model_doi: 'doi:10.1080/02626667.2018.1560449'
source_model_uuid: 'a72998bf-3b84-4ff4-91c6-7a6b780507bc'
source_model_repo_url: 'https://huggingface.co/SparkExampleMind/llamabase-8x2.3Q-instruct-v1.0-safetensor'
# Model Parents (Merges, Pre-tuning, etc...)
base_model_sources:
- name: GPT-3
author: OpenAI
version: '3.0'
organization: OpenAI
description: A large language model capable of performing a wide variety of language tasks.
url: 'https://openai.com/research/gpt-3'
doi: 10.5555/gpt3doi123456
uuid: 123e4567-e89b-12d3-a456-426614174000
repo_url: 'https://github.com/openai/gpt-3'
- name: BERT
author: Google AI Language
version: '1.0'
organization: Google
description: A transformer-based model pretrained on English to achieve state-of-the-art performance on a range of NLP tasks.
url: 'https://github.com/google-research/bert'
doi: 10.5555/bertdoi789012
uuid: 987e6543-e21a-43f3-a356-527614173999
repo_url: 'https://github.com/google-research/bert'
# Model Datasets Used (Training data...)
dataset_sources:
- name: Wikipedia Corpus
author: Wikimedia Foundation
version: 2021-06
organization: Wikimedia
description: A dataset comprising the full English Wikipedia, used to train models in a range of natural language tasks.
url: 'https://dumps.wikimedia.org/enwiki/'
doi: 10.5555/wikidoi234567
uuid: 234e5678-f90a-12d3-c567-426614172345
repo_url: 'https://github.com/wikimedia/wikipedia-corpus'
- name: Common Crawl
author: Common Crawl Foundation
version: 2021-04
organization: Common Crawl
description: A dataset containing web-crawled data from various domains, providing a broad range of text.
url: 'https://commoncrawl.org'
doi: 10.5555/ccdoi345678
uuid: 345e6789-f90b-34d5-d678-426614173456
repo_url: 'https://github.com/commoncrawl/cc-crawl-data'
# Model Content Metadata
tags:
- text generation
- transformer
- llama
- tiny
- tiny model
pipeline_tag:
- text-classification
language:
- en (Note: I also noticed that 'pipeline_tag' and 'languages' is missing an 's' at the end... but that's a nitpick) |
I feel that keeping |
Also, this new proposition in #2479 (comment) is adding way more fields to the model card than the suggestion in #2479 (comment). I think that adding |
Ah I see. So the parent references should have more details for easier retrieval, but the model itself can be understood by context. Fair enough. So if we don't dump out all the KV stuff, but just keep the stuff directly referenced in the current HF model card conventions (as defined in the python source)... plus the detailed parent model/datasets... this should look more like. # Model Card Fields
model_name: Example Model Six
# Licensing details
license: apache-2.0
license_name: Apache License Version 2.0, January 2004
license_link: https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md
# Model Parents (Merges, Pre-tuning, etc...)
base_model_sources:
- name: GPT-3
author: OpenAI
version: '3.0'
organization: OpenAI
description: A large language model capable of performing a wide variety of language tasks.
url: 'https://openai.com/research/gpt-3'
doi: 10.5555/gpt3doi123456
uuid: 123e4567-e89b-12d3-a456-426614174000
repo_url: 'https://github.com/openai/gpt-3'
- name: BERT
author: Google AI Language
version: '1.0'
organization: Google
description: A transformer-based model pretrained on English to achieve state-of-the-art performance on a range of NLP tasks.
url: 'https://github.com/google-research/bert'
doi: 10.5555/bertdoi789012
uuid: 987e6543-e21a-43f3-a356-527614173999
repo_url: 'https://github.com/google-research/bert'
# Model Datasets Used (Training data...)
dataset_sources:
- name: Wikipedia Corpus
author: Wikimedia Foundation
version: '2021-06'
organization: Wikimedia
description: A dataset comprising the full English Wikipedia, used to train models in a range of natural language tasks.
url: 'https://dumps.wikimedia.org/enwiki/'
doi: 10.5555/wikidoi234567
uuid: 234e5678-f90a-12d3-c567-426614172345
repo_url: 'https://github.com/wikimedia/wikipedia-corpus'
- name: Common Crawl
author: Common Crawl Foundation
version: '2021-04'
organization: Common Crawl
description: A dataset containing web-crawled data from various domains, providing a broad range of text.
url: 'https://commoncrawl.org'
doi: 10.5555/ccdoi345678
uuid: 345e6789-f90b-34d5-d678-426614173456
repo_url: 'https://github.com/commoncrawl/cc-crawl-data'
# Model Content Metadata
tags:
- text generation
- transformer
- llama
- tiny
- tiny model
language:
- en
#... other HF stuff here... but isn't present from the GGUF KV store...
#... so there is no direct analog... so will be omitted on llama.cpp side of the documentation... Well @Wauplin this does indeed look a bit more compact now. FYI, this is just going to be documentation on our side for now. But just double checking that we won't be stepping on any toes. Thumbs up if all green. (edit: removed pipeline_tag as I remembered it's not included in the GGUF) |
Nice, I can confirm that this version is not stepping on anyone's toes! 👍 |
proposal looks good to me, but I would, whenever possible, also include our simpler i.e. I would use the current proposal as an extension/add-on on top of existing conventional (simpler) metadata |
Okay thanks. FYI placed the mapping to https://github.com/ggerganov/llama.cpp/wiki/HuggingFace-Model-Card-Metadata-Interoperability-Consideration for future reference now. |
Is your feature request related to a problem? Please describe.
Was working on ggerganov/llama.cpp#8875 to integrate some changes to how we interpret parent models and datasets into GGUF metadata and was alerted that your code currently interprets the
datasets
as onlyList[str]
while the changes we are proposing would support these types indatasets
andbase_model
:List[str]
of hugging face idList[str]
of urls to other reposList[dict]
of dict with fields like name, author, version, organization, url, doi, uuid and repo_urlDescribe the solution you'd like
Update description to indicate support for urls and dict metadata in both
datasets
andbase_model
entry in model card as well as update typechecks to support dict as an option.Describe alternatives you've considered
We already can support these extra metadata in GGUF file format via metadata override files, but it would be nice to be able to sync these feature so we can more easily grab these information from model creator's model card.
Additional context
The code area I'm looking at is
huggingface_hub/src/huggingface_hub/repocard_data.py
Lines 249 to 251 in e9cd695
The text was updated successfully, but these errors were encountered: