Correct JSONL format for Vertex AI training.

Tried to upload data in JSONL format in Vertex AI, based on the example given. It still gives an error - Invalid data. Missing one or more required fields.
What can be the issue here. 

Data format pushed - 
{"input_text": "Please provide fresh website content for upto <5-page> website. Each page should not exceed more than 200 words of content.", "output_text": "Page 1 Home - Title (30 to 50 Characters) RCA & Asociados, Professional Services, Chula Vista, California. \n Description (280 to 300 Characters) RCA is a regional Foodservice Disposables and packaging broker servicing company in Mexico and Southern California. Our sales representatives are all bilingual and travel in and out of Mexico and the US. We are associated with both large and small independent distributors, chain accounts, and C-Stores. \n Title Foodservice and Packaging Broker with 10+ Years of Experience in CA \n Hero Text We Focus On Manufacturers Looking To Increase Sales & Help Them Grow Customers In CA & Mexico! Contact Us (Link to Contact Page Button) \n (H1) (20 to 70 Characters) RCA – Profile, Mission & Vision RCA is a regional foodservice disposable and packaging broker servicing company in Mexico and Southern California. We have been in business since 2006 and are associated with both large and small independent distributors, chain accounts, and C-Stores. \n We have offices, both in Mexico and the US, and our sales representatives are bilingual and weekly travel in and out of Mexico and the US. \n (H2) Our Strategic Mission. Our goal is to represent and focus only on a handful of manufacturers who are looking to expand sales in Mexico and Southern California. We help them to grow their market share and increase their selective list of valued customers in Mexico, who are strategic and smart. \n (H2) Our Business Vision \n We establish long term relationships with our suppliers and distributors, together, building a successful partnership benefiting our mutual interests.\n Page 2 Services - Title (30 to 50 Characters) RCA & Asociados, Services, Chula Vista, California \n Description (280 to 300 Characters) RCA is an established, professional, and reputable company having good relationships with a wide range of distributors in Mexico and the US. We cater to the Hispanic and Asian Markets in the US and are exploring the opportunity to represent and work with your company to help you improve your market range. Contact us. \n (H1) (20 to 70 Characters) Well Organized, Professional & Reputed Organization \n RCA is an established, professional, and reputable company having good relationships with a wide range of distributors in Mexico and the US. We cater to the Hispanic and Asian Markets in the US and are exploring the opportunity to represent and work with your company to help you improve your market range. \n (H2) Your Local Salesforce \n We provide our clients who are manufacturers and who have never been to Mexico with transparency and authenticity. We bring in experts to help create relationships with their consumers and grow their client base. \n (H2) Our Valued Relationships \n We have established relationships with many markets in Mexico and can help you connect with foodservice distributors, retail, grocery, and chain accounts, depending on your needs. \n View our Clients (Links to Our Clients) \n Page 3 Our Clients \n Title (30 to 50 Characters) RCA & Asociados, Our Clients, Chula Vista, California \n Description (280 to 300 Characters) RCA & Asociados has been active in foodservice disposables and packaging brokerage since 2006 and is headquartered in Chula Vista, CA. Our primary focus is to help companies grow market share and, with a joint effort, successfully penetrate the market. Know our customers. \n (H1) (20 to 70 Characters) RCA Clientele \n Page 4 Contact Us \n Title (30 to 50 Characters) RCA & Asociados , Contact Us , Chula Vista, California \n Description (280 to 300 Characters) RCA & Asociados believes in its vision of becoming the premier brokerage company in Mexico & the US. We offer experienced and professional sales representation. We specialize in selling and promoting US and Mexican products within the foodservice and packaging industry. \n (H1) (20 to 70 Characters) Get in Touch With Us."}

13 REPLIES 13

Roderick
Community Manager
Community Manager

Hi there @Srushti_Zope,

The error message "Invalid data. Missing one or more required fields" means that the JSONL data you are trying to upload is missing some of the required fields. The required fields for a Vertex AI dataset are:

  • input_text: The text that you want to be processed by the model.
  • output_text: The predicted output text.

In your JSONL data, the output_text field is missing. You can fix this by adding the output_text field to your JSONL data, and then re-uploading the data to Vertex AI.

Here is an example of a valid JSONL data record:

{
  "input_text": "This is the input text.",
  "output_text": "This is the output text."
}

Once you have added the output_text field to your JSONL data, you can re-upload the data to Vertex AI. The error message should disappear, and the data should be processed successfully.

Here are some additional troubleshooting steps you can take:

  • Make sure that the JSONL data is formatted correctly.
  • Check the Vertex AI documentation for more information on the required fields.
  • Contact Google Cloud support for assistance.

Hope this helps!

 

Hi @Roderick 
1. There is an output text available in the second line in the above example. 

2. Also, tried uploading the same example mentioned on the vertex AI portal, but still it did not accept it and gave the same error. 

{"input_text": "Create a description for Plantation Palms", "output_text": "Enjoy some fun in the sun at Gulf Shores."}

3. In the meantime, Please share the documentation link for the correct JSONL format. 

Roderick
Community Manager
Community Manager

Sure, here are some links that you can refer to for the correct Vertex AI JSONL format:

The JSONL format for Vertex AI is a simple text format that is easy to read and write. Each line of the JSONL file represents a single prediction instance. The format of the prediction instance depends on the model that you are using.

Here is an example of a JSONL file that can be used for batch predictions with Vertex AI:

{"text": "This is a text prediction instance."}
{"text": "This is another text prediction instance."}

The text field is the only required field in the JSONL file. However, you may also include other fields in the file, depending on the model that you are using.

Thank you Roderick! Some basic questions please. Is it possible to fine tune a model on a particular set of knowledge base say "laws on defamation" and then use the model to answer any questions on the said topic from the said data. I assume yes but I wonder how exactly would the "input text" and "output text" will work on it. 

Is there a method whereby we could simply upload pdfs to train the model? Lazy me..lol 

Thank you

Hi I run into the same problem to get invalid field data. Using your example doesn't work either. could you please help? thanks!

I am facing the same issue. I have noticed that the behavior is non-deterministic. I was able to fine tune a model with a sample json file with proper jsonl file with ''input_text' and "output_text". And after a while the same jsonl file stopped being accepted. As an evidence I have a test tuned model working based on the same jsonl file.

same problem here, it looks a bug on GCP?

The error message you received indicates that the data you provided is missing one or more required fields. To diagnose the issue, let's compare the data you provided with the expected format for JSONL files in Vertex AI.

In Vertex AI, when uploading data in JSONL format, each line in the file should contain a valid JSON object. Looking at the data you provided, it seems that you have a single JSON object, so it should be valid.

However, there might be other reasons causing the "Invalid data" error. Here are a few possibilities to consider:

1. Ensure that each line in your JSONL file contains a valid JSON object. It's possible that there might be an issue with line breaks or formatting. Make sure that each line ends with a newline character ("\n") and that there are no additional spaces or characters at the beginning or end of each line.

2. Check if there are any special characters or escape sequences within the text fields. Certain characters, such as quotation marks or backslashes, might need to be properly escaped within the JSON object.

3. Verify that there are no missing or misspelled required field names. The error message suggests that one or more required fields are missing. Check the documentation or example provided by Vertex AI to ensure that you're including all the required fields and that their names match exactly.

If you've reviewed these possibilities and still encounter the error, please provide the exact error message you received, as it might contain more specific information about the issue.

I get this error and its driving me nuts!>> invalid JSON in google.cloud.discoveryengine.v1main.Document, near 1:9 (offset 8): no such field: 'test'

I noticed some discrepancy in the requirements for the format. The data schema specifies the use of a ".json" file extension. However, the format of the data adheres to the JSON Lines (JSONL) standard, which is typically associated with a ".jsonl" or ".ndjson" extension. Those will not be accepted. So the way to go - use the schema of JSONL, but setup the file with .JSON extension. If you try to open the file you'll eventually get a parsing error, but this is apparently what 's needed as it worked fine for me after numerous tests.

Do we have any sample test data which can show how exactly we need to have the JSON Lines file? Tried all the above suggested operation and it is still failing

Here is an example of JSON Lines from its official site :

{"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]}
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "May", "wins": []}
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}

 So basically you need this data structure, but the extension should be .json, not .jsonl