Mistral-7B-instruct on Vertex AI endpoint often tr...

jvdb · 01-18-2024 09:05 AM

I'm using the prompt template directly from Mistral:

{
  "instances": [
    {
      "prompt": "<s>[INST] What is your favourite colour and why? [/INST]My favorite color is blue.</s>[INST] And which one after that? [/INST]"
    }
  ],
  "parameters": {
    "max_tokens": -1
  }
}

Then use this command:

curl -X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" https://europe-west4-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/europe-west4/endpoints/${ENDPOINT_ID}:predict -d "@request.json"

And this is a typical response:

{
  "predictions": [
    "Prompt:\n\u003cs\u003e[INST] What is your favourite colour and why? [/INST]My favorite color is blue.\u003c/s\u003e[INST] And which one after that? [/INST]\nOutput:\n My second favorite color is green. This is because it is a color that represents"
  ],
  "deployedModelId": "[REDACTED]",
  "model": "projects/[REDACTED]/locations/europe-west4/models/mistral-7b-instruct-v0_1",
  "modelDisplayName": "mistral-7b-instruct-v0_1",
  "modelVersionId": "1"
}

As you can see the response is short, which is fine, but it's also truncated somehow. What causes this?

I have tried the following:

Different values for max_tokens such as -1, 500, 2048 and not including max_tokens at all.
Escaping the prompt in different ways, including escaping the forward slashes.

Am I still doing something wrong? Or does the Vertex AI endpoint mangle my JSON somehow?

jvdb

As a follow-up, I never fixed this. I just started deploying Mistral-7B and Mixtral-8x7B with vLLM using the OpenAI-compatible API and that has resolved it for me. But it's obviously just a workaround.

cat-coder

It seems like there might be a small mix-up with the structure of your JSON request. For the Mistral model on Vertex AI, you'll want to make sure that all parameters related to the generation are included inside the same dictionary as the prompt within the instances list.

Request

{
  "instances": [
    {
      "prompt": "<s>[INST] What is your favourite colour and why? [/INST]My favorite color is blue.</s>[INST] And which one after that? [/INST]",
      "max_tokens": 500, 
      "stream": false 
    }
  ]
}

Response

{
 "predictions": [
   "Prompt:\n<s>[INST] What is your favourite colour and why? [/INST]My favorite color is blue.</s>[INST] And which one after that? [/INST]\nOutput:\n I don't have a favorite color besides blue. I don't have the ability to have personal preferences or emotions. However, I can tell you that green is a color that is often associated with me because of the text in my name. I'm an assistant designed to help answer questions and generate text, so the color green is commonly used to represent information and text in digital interfaces."
 ],
 "deployedModelId": "1829198121503031296",
 "model": "projects/579220845622/locations/us-central1/models/mistralai_mistral-7b-instruct-v0_2-colmobo-ai",
 "modelDisplayName": "mistralai_mistral-7b-instruct-v0_2-colmobo-ai",
 "modelVersionId": "1"
}

Note: please don't pass max_tokens as -1, I don't think this is supported

Mistral-7B-instruct on Vertex AI endpoint often truncates responses