From b04604bf73c8b65cca1f6cecf7dd741691b8bfa0 Mon Sep 17 00:00:00 2001 From: "nidhin.devan" Date: Thu, 7 Nov 2024 19:17:59 +0530 Subject: [PATCH] Constrained generation with gemma notebook added, README updated --- Gemma/Constrained_generation_with_Gemma.ipynb | 710 ++++++++++++++++++ README.md | 3 +- 2 files changed, 712 insertions(+), 1 deletion(-) create mode 100644 Gemma/Constrained_generation_with_Gemma.ipynb diff --git a/Gemma/Constrained_generation_with_Gemma.ipynb b/Gemma/Constrained_generation_with_Gemma.ipynb new file mode 100644 index 0000000..39d3803 --- /dev/null +++ b/Gemma/Constrained_generation_with_Gemma.ipynb @@ -0,0 +1,710 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "3DNHhOsQX4Rg" + }, + "source": [ + "##### Copyright 2024 Google LLC." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "cellView": "form", + "id": "_bO8_SJzX4t6" + }, + "outputs": [], + "source": [ + "# @title Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ABzryOPIYE2O" + }, + "source": [ + "# Getting Started with Constrained generation with Gemma 2 using Llamacpp and Guidance\n", + "\n", + "[Gemma](https://ai.google.dev/gemma) is a family of lightweight, state-of-the-art open-source language models from Google. Built from the same research and technology used to create the Gemini models, Gemma models are text-to-text, decoder-only large language models (LLMs), available in English, with open weights, pre-trained variants, and instruction-tuned variants.\n", + "Gemma models are well-suited for various text-generation tasks, including question-answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop, or cloud infrastructure, democratizing access to state-of-the-art AI models and helping foster innovation for everyone.\n", + "\n", + "Constrained generation is a method that modifies the token generation process of a generative model to limit its predictions for subsequent tokens to only those that adhere to the necessary output structure.\n", + "\n", + "[llama.cpp](https://github.com/ggerganov/llama.cpp) is a C++ implementation of Meta AI's LLaMA and other large language model architectures, designed for efficient performance on local machines or within environments like Google Colab. It enables you to run large language models without needing extensive computational resources. In llama.cpp, formal grammars are defined using the GBNF (GGML BNF) format to constrain model outputs. It can be used, for instance, to make the model produce legitimate JSON or to communicate exclusively in emojis.\n", + "\n", + "[Guidance](https://github.com/guidance-ai/guidance/tree/main?tab=readme-ov-file#constrained-generation) is an effective programming paradigm for steering language models. Guidance reduces latency and costs compared to traditional prompting or fine-tuning while allowing you to control the output's structure and provide high-quality output for your use case.\n", + "\n", + "In this notebook, you will learn how to perform constrained generation in Gemma 2 models using `llama.cpp` and `guidance` in a Google Colab environment. You'll install the necessary packages, set up the model, and run a sample prompt.\n", + "\n", + "\n", + "\n", + "
\n", + " Run in Google Colab\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6TDJ3j0rh-Uw" + }, + "source": [ + "## Setup\n", + "\n", + "### Select the Colab runtime\n", + "To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:\n", + "\n", + "1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.\n", + "2. Select **Change runtime type**.\n", + "3. Under **Hardware accelerator**, select **T4 GPU**.\n", + "\n", + "### Gemma setup\n", + "\n", + "**Before you dive into the tutorial, let's get you set up with Gemma:**\n", + "\n", + "1. **Hugging Face Account:** If you don't already have one, you can create a free Hugging Face account by clicking [here](https://huggingface.co/join).\n", + "2. **Gemma Model Access:** Head over to the [Gemma model page](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315) and accept the usage conditions.\n", + "3. **Colab with Gemma Power:** For this tutorial, you'll need a Colab runtime with enough resources to handle the Gemma 2B model. Choose an appropriate runtime when starting your Colab session.\n", + "4. **Hugging Face Token:** Generate a Hugging Face access (preferably `write` permission) token by clicking [here](https://huggingface.co/settings/tokens). You'll need this token later in the tutorial.\n", + "\n", + "**Once you've completed these steps, you're ready to move on to the next section where you'll set up environment variables in your Colab environment.**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7tvumI5CiEeG" + }, + "source": [ + "### Configure your HF token\n", + "\n", + "Add your Hugging Face token to the Colab Secrets manager to securely store it.\n", + "\n", + "1. Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel. \"The\n", + "2. Create a new secret with the name `HF_TOKEN`.\n", + "3. Copy/paste your token key into the Value input box of `HF_TOKEN`.\n", + "4. Toggle the button on the left to allow notebook access to the secret." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "lpXvNz0HeKc1" + }, + "outputs": [], + "source": [ + "import os\n", + "from google.colab import userdata\n", + "\n", + "# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env\n", + "# vars as appropriate for your system.\n", + "os.environ[\"HF_TOKEN\"] = userdata.get(\"HF_TOKEN\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fxfWSVsaiFQK" + }, + "source": [ + "### Install dependencies\n", + "\n", + "You'll need to install a few Python packages and dependencies to interact with HuggingFace along with `llama-cpp-python` and `guidance`. Find some of the releases of `llama-cpp-python` supporting CUDA 12.2 [here](https://abetlen.github.io/llama-cpp-python/whl/cu122/llama-cpp-python/).\n", + "\n", + "Run the following cell to install or upgrade it:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "Xm9wxRQ_eUkC" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/447.5 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[91m━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m143.4/447.5 kB\u001b[0m \u001b[31m8.8 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m \u001b[32m440.3/447.5 kB\u001b[0m \u001b[31m11.1 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m447.5/447.5 kB\u001b[0m \u001b[31m6.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting guidance\n", + " Downloading guidance-0.1.16-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.7 kB)\n", + "Collecting diskcache (from guidance)\n", + " Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)\n", + "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from guidance) (1.26.4)\n", + "Collecting ordered-set (from guidance)\n", + " Downloading ordered_set-4.1.0-py3-none-any.whl.metadata (5.3 kB)\n", + "Requirement already satisfied: platformdirs in /usr/local/lib/python3.10/dist-packages (from guidance) (4.3.6)\n", + "Requirement already satisfied: protobuf in /usr/local/lib/python3.10/dist-packages (from guidance) (3.20.3)\n", + "Requirement already satisfied: pydantic in /usr/local/lib/python3.10/dist-packages (from guidance) (2.9.2)\n", + "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from guidance) (2.32.3)\n", + "Collecting tiktoken>=0.3 (from guidance)\n", + " Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)\n", + "Requirement already satisfied: regex>=2022.1.18 in /usr/local/lib/python3.10/dist-packages (from tiktoken>=0.3->guidance) (2024.9.11)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->guidance) (3.4.0)\n", + "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->guidance) (3.10)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->guidance) (2.2.3)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->guidance) (2024.8.30)\n", + "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.10/dist-packages (from pydantic->guidance) (0.7.0)\n", + "Requirement already satisfied: pydantic-core==2.23.4 in /usr/local/lib/python3.10/dist-packages (from pydantic->guidance) (2.23.4)\n", + "Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic->guidance) (4.12.2)\n", + "Downloading guidance-0.1.16-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (255 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m255.4/255.4 kB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.2/1.2 MB\u001b[0m \u001b[31m37.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading diskcache-5.6.3-py3-none-any.whl (45 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m45.5/45.5 kB\u001b[0m \u001b[31m3.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading ordered_set-4.1.0-py3-none-any.whl (7.6 kB)\n", + "Installing collected packages: ordered-set, diskcache, tiktoken, guidance\n", + "Successfully installed diskcache-5.6.3 guidance-0.1.16 ordered-set-4.1.0 tiktoken-0.8.0\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m443.8/443.8 MB\u001b[0m \u001b[31m4.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h" + ] + } + ], + "source": [ + "# The huggingface_hub library allows us to download models and other files from Hugging Face.\n", + "!pip install --upgrade -q huggingface_hub\n", + "\n", + "# Install the guidance package.\n", + "!pip install guidance\n", + "\n", + "# The llama-cpp-python library allows us to leverage GPUs.\n", + "!pip install llama-cpp-python==0.2.90 \\\n", + " -q -U --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CtCpooOBipDy" + }, + "source": [ + "### Logging into Hugging Face Hub\n", + "\n", + "Next, you’ll need to log into the Hugging Face Hub using your access token to download the Gemma model." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "CzMQZ1SReagV" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.\n", + "WARNING:huggingface_hub._login:Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.\n" + ] + } + ], + "source": [ + "from huggingface_hub import login\n", + "\n", + "login(os.environ[\"HF_TOKEN\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8ZjH8VRZismE" + }, + "source": [ + "### Downloading the Gemma 2 Model\n", + "Once you're logged in, you can download the Gemma 2 model files from Hugging Face. The [Gemma 2 model](https://huggingface.co/google/gemma-2-2b-GGUF) is available in **GGUF** format, which is optimized for use with `llama.cpp` and compatible tools like Llamafile." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "id": "K2QM6BO3ebUY" + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "15433f2b900b4034ae81f61a5780efd5", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "2b_pt_v2.gguf: 0%| | 0.00/10.5G [00:00\", \"\", \"\", \"\", ...\n", + "llama_model_loader: - kv 19: tokenizer.ggml.scores arr[f32,256000] = [-1000.000000, -1000.000000, -1000.00...\n", + "llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...\n", + "llama_model_loader: - kv 21: tokenizer.ggml.bos_token_id u32 = 2\n", + "llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 1\n", + "llama_model_loader: - kv 23: tokenizer.ggml.unknown_token_id u32 = 3\n", + "llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 0\n", + "llama_model_loader: - kv 25: tokenizer.ggml.add_bos_token bool = true\n", + "llama_model_loader: - kv 26: tokenizer.ggml.add_eos_token bool = false\n", + "llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false\n", + "llama_model_loader: - kv 28: general.quantization_version u32 = 2\n", + "llama_model_loader: - type f32: 288 tensors\n", + "llm_load_vocab: special tokens cache size = 249\n", + "llm_load_vocab: token to piece cache size = 1.6014 MB\n", + "llm_load_print_meta: format = GGUF V3 (latest)\n", + "llm_load_print_meta: arch = gemma2\n", + "llm_load_print_meta: vocab type = SPM\n", + "llm_load_print_meta: n_vocab = 256000\n", + "llm_load_print_meta: n_merges = 0\n", + "llm_load_print_meta: vocab_only = 0\n", + "llm_load_print_meta: n_ctx_train = 8192\n", + "llm_load_print_meta: n_embd = 2304\n", + "llm_load_print_meta: n_layer = 26\n", + "llm_load_print_meta: n_head = 8\n", + "llm_load_print_meta: n_head_kv = 4\n", + "llm_load_print_meta: n_rot = 256\n", + "llm_load_print_meta: n_swa = 4096\n", + "llm_load_print_meta: n_embd_head_k = 256\n", + "llm_load_print_meta: n_embd_head_v = 256\n", + "llm_load_print_meta: n_gqa = 2\n", + "llm_load_print_meta: n_embd_k_gqa = 1024\n", + "llm_load_print_meta: n_embd_v_gqa = 1024\n", + "llm_load_print_meta: f_norm_eps = 0.0e+00\n", + "llm_load_print_meta: f_norm_rms_eps = 1.0e-06\n", + "llm_load_print_meta: f_clamp_kqv = 0.0e+00\n", + "llm_load_print_meta: f_max_alibi_bias = 0.0e+00\n", + "llm_load_print_meta: f_logit_scale = 0.0e+00\n", + "llm_load_print_meta: n_ff = 9216\n", + "llm_load_print_meta: n_expert = 0\n", + "llm_load_print_meta: n_expert_used = 0\n", + "llm_load_print_meta: causal attn = 1\n", + "llm_load_print_meta: pooling type = 0\n", + "llm_load_print_meta: rope type = 2\n", + "llm_load_print_meta: rope scaling = linear\n", + "llm_load_print_meta: freq_base_train = 10000.0\n", + "llm_load_print_meta: freq_scale_train = 1\n", + "llm_load_print_meta: n_ctx_orig_yarn = 8192\n", + "llm_load_print_meta: rope_finetuned = unknown\n", + "llm_load_print_meta: ssm_d_conv = 0\n", + "llm_load_print_meta: ssm_d_inner = 0\n", + "llm_load_print_meta: ssm_d_state = 0\n", + "llm_load_print_meta: ssm_dt_rank = 0\n", + "llm_load_print_meta: ssm_dt_b_c_rms = 0\n", + "llm_load_print_meta: model type = 2B\n", + "llm_load_print_meta: model ftype = all F32\n", + "llm_load_print_meta: model params = 2.61 B\n", + "llm_load_print_meta: model size = 9.74 GiB (32.00 BPW) \n", + "llm_load_print_meta: general.name = ff8948d2ca54b23c93d253533c6effcf2e892347\n", + "llm_load_print_meta: BOS token = 2 ''\n", + "llm_load_print_meta: EOS token = 1 ''\n", + "llm_load_print_meta: UNK token = 3 ''\n", + "llm_load_print_meta: PAD token = 0 ''\n", + "llm_load_print_meta: LF token = 227 '<0x0A>'\n", + "llm_load_print_meta: EOT token = 107 ''\n", + "llm_load_print_meta: max token length = 48\n", + "llm_load_tensors: ggml ctx size = 0.13 MiB\n", + "llm_load_tensors: offloading 0 repeating layers to GPU\n", + "llm_load_tensors: offloaded 0/27 layers to GPU\n", + "llm_load_tensors: CPU buffer size = 9972.92 MiB\n", + "..................................................................\n", + "llama_new_context_with_model: n_ctx = 512\n", + "llama_new_context_with_model: n_batch = 512\n", + "llama_new_context_with_model: n_ubatch = 512\n", + "llama_new_context_with_model: flash_attn = 0\n", + "llama_new_context_with_model: freq_base = 10000.0\n", + "llama_new_context_with_model: freq_scale = 1\n", + "llama_kv_cache_init: CUDA_Host KV buffer size = 52.00 MiB\n", + "llama_new_context_with_model: KV self size = 52.00 MiB, K (f16): 26.00 MiB, V (f16): 26.00 MiB\n", + "llama_new_context_with_model: CUDA_Host output buffer size = 0.98 MiB\n", + "llama_new_context_with_model: CUDA0 compute buffer size = 2754.50 MiB\n", + "llama_new_context_with_model: CUDA_Host compute buffer size = 6.51 MiB\n", + "llama_new_context_with_model: graph nodes = 1050\n", + "llama_new_context_with_model: graph splits = 342\n", + "AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | \n", + "Model metadata: {'tokenizer.ggml.add_bos_token': 'true', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.unknown_token_id': '3', 'tokenizer.ggml.bos_token_id': '2', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'tokenizer.ggml.add_space_prefix': 'false', 'tokenizer.ggml.add_eos_token': 'false', 'gemma2.final_logit_softcapping': '30.000000', 'gemma2.attn_logit_softcapping': '50.000000', 'general.architecture': 'gemma2', 'gemma2.context_length': '8192', 'gemma2.attention.head_count_kv': '4', 'gemma2.attention.layer_norm_rms_epsilon': '0.000001', 'general.type': 'model', 'tokenizer.ggml.eos_token_id': '1', 'gemma2.embedding_length': '2304', 'tokenizer.ggml.pre': 'default', 'general.name': 'ff8948d2ca54b23c93d253533c6effcf2e892347', 'gemma2.block_count': '26', 'gemma2.feed_forward_length': '9216', 'gemma2.attention.key_length': '256', 'gemma2.attention.head_count': '8', 'gemma2.attention.sliding_window': '4096', 'gemma2.attention.value_length': '256', 'general.file_type': '0'}\n", + "Using fallback chat format: llama-2\n" + ] + } + ], + "source": [ + "import guidance\n", + "import numpy as np\n", + "from guidance import models, gen, block, optional, select, zero_or_more\n", + "from guidance import commit_point\n", + "\n", + "# Load the model\n", + "model_path = \"2b_pt_v2.gguf\"\n", + "gemma2 = models.LlamaCpp(model_path)\n", + "\n", + "\n", + "# Custom generation function to repeat the content up to two. Similar to\n", + "# one_or_more but there is a max value here.\n", + "@guidance(stateless=True)\n", + "def repeat_range(lm, content, min_count=1, max_count=2):\n", + " for _ in range(min_count):\n", + " lm += content\n", + " if max_count == np.inf:\n", + " lm += zero_or_more(content)\n", + " else:\n", + " for _ in range(max_count - min_count):\n", + " lm += optional(content)\n", + " return lm\n", + "\n", + "# Function to generate numbers up to two digits.\n", + "@guidance(stateless=True)\n", + "def number(lm):\n", + " n = repeat_range(select(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']))\n", + " # Allow for negative or positive numbers\n", + " return lm + select(['-' + n, n])\n", + "\n", + "# Function to select player position.\n", + "@guidance(stateless=True)\n", + "def position(lm):\n", + " return lm + select([\"Striker\", \"Midfielder\", \"Defender\", \"Goalkeeper\"])\n", + "\n", + "# Function to select whether the player has won a World cup.\n", + "@guidance(stateless=True)\n", + "def world_cup(lm):\n", + " return lm + select([\"Yes\", \"No\"])\n", + "\n", + "# Regex function for string.\n", + "@guidance(stateless=True)\n", + "def string_exp(lm):\n", + " return lm + gen(regex='([^\\\\\\\\]*|\\\\\\\\[\\\\\\\\bfnrt\\/]|\\\\\\\\u[0-7a-z])*')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1YL3DdFi762J" + }, + "source": [ + "For this example, you will implement a combination of interleaved generative structure and CFG to show the stats of football(soccer) players as JSON.\n", + "\n", + "Here, you will keep the structure and keys of the JSON static, allowing the language model to fill in the value parts. This maintains the overall structure of the output." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "id": "f39A6-Aa73FE" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
Using JSON, describe these Football players:\n",
+              "Lionel Messi\n",
+              "{\n",
+              ""name": "Lionel Messi",\n",
+              ""country": "Argentina",\n",
+              ""position": Striker,\n",
+              ""stats": {\n",
+              "         "goals":10,\n",
+              "    "assists": 10,\n",
+              "    "height": 1.75,\n",
+              "    "world-cup": Yes,\n",
+              "}}
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# `commit_point`s are just ways of stopping functions once you hit a point.\n", + "# For eg: commit_point(\",\") stops string_exp() once you hit `,`.\n", + "@guidance(stateless=True)\n", + "def simple_json(lm):\n", + " lm += ('{\\n' +\n", + " '\"name\": ' + string_exp() + commit_point(',') + '\\n'\n", + " '\"country\": ' + string_exp() + commit_point(',') + '\\n'\n", + " '\"position\": ' + position() + commit_point(',') + '\\n'\n", + " '\"stats\": {\\n' +\n", + " ' \"goals\":'+ number() + commit_point(',') + '\\n'\n", + " ' \"assists\": ' + number() + commit_point(',') + '\\n'\n", + " ' \"height\": ' + number() +'.' + number() + commit_point(',') + '\\n'\n", + " ' \"world-cup\": ' + world_cup() + commit_point(',') + '\\n'\n", + " + commit_point('}')\n", + " + commit_point('}'))\n", + " return lm\n", + "\n", + "# Initialize the query.\n", + "lm = gemma2 + \"\"\"Using JSON, describe these Football players:\n", + "Lionel Messi\n", + "\"\"\"\n", + "\n", + "# Call the simple_json function and implement the JSON structure.\n", + "lm += simple_json()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Zi8O6AF4As4N" + }, + "source": [ + "Congratulations! You've successfully implemented constrained generation with the Gemma 2 model using `llama.cpp` and `Guidance` in a Colab environment. You can now experiment with the model, update the grammar, and explore its capabilities." + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "name": "Constrained_generation_with_Gemma.ipynb", + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/README.md b/README.md index 4764815..3c317bc 100644 --- a/README.md +++ b/README.md @@ -65,7 +65,8 @@ You can find the Gemma models on GitHub, Hugging Face models, Kaggle, Google Clo | [Using_Gemma_with_LlamaCpp.ipynb](Gemma/Using_Gemma_with_LlamaCpp.ipynb) | Run Gemma models using [LlamaCpp](https://github.com/abetlen/llama-cpp-python/). | | [Using_Gemma_with_LocalGemma.ipynb](Gemma/Using_Gemma_with_LocalGemma.ipynb) | Run Gemma models using [Local Gemma](https://github.com/huggingface/local-gemma/). | | [Using_Gemini_and_Gemma_with_RouteLLM.ipynb](Gemma/Using_Gemini_and_Gemma_with_RouteLLM.ipynb) | Route Gemma and Gemini models using [RouteLLM](https://github.com/lm-sys/RouteLLM/). | -| [Using_Gemma_with_SGLang.ipynb](Gemma/Using_Gemma_with_SGLang.ipynb) | Run Gemma models using [SGLang](https://github.com/sgl-project/sglang/). | +| [Using_Gemma_with_SGLang.ipynb](Gemma/Using_Gemma_with_SGLang.ipynb) | Run Gemma models using [SGLang](https://github.com/sgl-project/sglang/). | +| [Constrained_generation_with_Gemma.ipynb](Gemma/Constrained_generation_with_Gemma.ipynb) | Constrained generation with Gemma models using [LlamaCpp](https://github.com/abetlen/llama-cpp-python/) and [Guidance](https://github.com/guidance-ai/guidance/tree/main/). | | [Integrate_with_Mesop.ipynb](Gemma/Integrate_with_Mesop.ipynb) | Integrate Gemma with [Google Mesop](https://google.github.io/mesop/). | | [Integrate_with_OneTwo.ipynb](Gemma/Integrate_with_OneTwo.ipynb) | Integrate Gemma with [Google OneTwo](https://github.com/google-deepmind/onetwo). | | [Deploy_with_vLLM.ipynb](Gemma/Deploy_with_vLLM.ipynb) | Deploy a Gemma model using [vLLM](https://github.com/vllm-project/vllm). |