Vertex AI pipeline - IndexError: Invalid key: 0 is out of bounds for size 0

Hi, 

I am trying to fine-tune Llama2-7B model using Vertex AI model garden (Google collab). Please find the details below: 

Model used - Llama2-7B  

Fine-tuning method - PEFT  

Number of samples in Training Set - 100  

Number of samples in Eval Set - 20  

Format of the training data -  

 

{"text": "### Human: What is arithmatic mean? ### Assistant: The arithmetic mean, or simply the mean, is the average of a set of numbers obtained by adding them up and dividing by the total count of numbers."}
{"text": "### Human: What is geometric mean? ### Assistant: The geometric mean is a measure of central tendency calculated by multiplying all values in a dataset and then taking the nth root of the product, where n is the total number of values."}

 

 

Vertex pipeline parameters :  

 

pipeline_parameters = {
    "base_model": base_model,
    "dataset_name": dataset_name,
    "prediction_accelerator_type": prediction_accelerator_type,
    "training_accelerator_type": training_accelerator_type,
    "training_precision_mode": training_precision_mode,
    "training_lora_rank": 16,
    "training_lora_alpha": 32,
    "training_lora_dropout": 0.05,
    "training_steps": 20,
    "training_warmup_steps": 10,
    "training_learning_rate": 2e-4,
    "evaluation_steps": 10,
    "evaluation_limit": 1,
}

 

When I execute the training process, I get the below error:  

 

raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")  
IndexError: Invalid key: 0 is out of bounds for size 0

 

Can you please help in understanding the below question? 

1. Is the format of training data correct ? I used the format which was given as default example in Collab notebook, you can find the dataset here

 2. Is the number of samples too less ? 

3. Is there anything I am missing here ? 

 

Thank you,  

KK

0 2 153
2 REPLIES 2

1.Is the format of training data correct ? I used the format which was given as default example in Collab notebook, you can find the dataset here
I have tried to validate your CSV file here(https://jsonlines.org/validator/), it seems that it is formatted fine.

2. Is the number of samples too less ?
There is no defined guidelines for the samples but it is recommended that you have 10x more samples than feature in your dataset

3. Is there anything I am missing here ?
Looking at the error It appears it's having issues with a data structure that must be pointed somewhere. But upon research it looks like it is an ongoing issue to the model here are examples of issues with solutions that might be helpful:

[1]
"data = data["train"].shuffle().map(generate_and_tokenize_prompt, batched = False) # change this line to -

data["train"] = data["train"].shuffle().map(generate_and_tokenize_prompt, batched = False)
After doing this change you code should run fine."
[2]
"Just add remove_unused_columns=False to TrainingArguments"

[1]https://github.com/huggingface/datasets/issues/5946#issuecomment-1635764512
[2]https://discuss.huggingface.co/t/indexerror-invalid-key-16-is-out-of-bounds-for-size-0/14298/25

@nceniza Thanks for going through details and providing the inputs. 

[1] Changing to data["train"] = data["train"].shuffle().map(generate_and_tokenize_prompt, batched = False)

I am using the notebook from [here](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/...

I could not find where exactly to make the above suggested change. Could you please let me know if I am missing anything here ?  

[2] Add remove_unused_columns=False  

I have added the remove_unused_columns here, however it does not seem like it is fixing the issue.    

flags = {
    "learning_rate": 2e-4,
    "precision_mode": finetune_precision_mode,
    "task": "instruct-lora",
    "per_device_train_batch_size": per_device_train_batch_size,
    "dataset_name": dataset_name,
    "instruct_column_in_dataset": instruct_column_in_dataset,
    "template": template,
    "pretrained_model_id": model_id,
    "output_dir": output_dir,
    "merge_base_and_lora_output_dir": merged_model_output_dir,
    "warmup_steps": 10,
    "max_steps": max_steps,
    "lora_rank": lora_rank,
    "lora_alpha": lora_alpha,
    "lora_dropout": lora_dropout,
    "evaluation_limit": 100,
    "remove_unused_columns": False 
}

train_job = aiplatform.CustomJob(
    display_name=job_name,
    worker_pool_specs=[
        {
            "machine_spec": {
                "machine_type": machine_type,
                "accelerator_type": accelerator_type,
                "accelerator_count": accelerator_count,
            },
            "replica_count": replica_count,
            "container_spec": {
                "image_uri": TRAIN_DOCKER_URI,
                "args": ["--{}={}".format(k, v) for k, v in flags.items()],
            },
        }
    ],
    staging_bucket=STAGING_BUCKET,
)
train_job.run()

 Any help would be appreciated, thank you. 

KK