Multimodal embeddings resized to 512x512 - any log...

ragnarbulldev · 04-14-2024 08:14 PM

So multimodal embeddings are resized by the Embeddings API to 512x512 pixels as stated here:

The maximum image size accepted is 20 MB. To avoid increased network latency, use smaller images. Additionally, the model resizes images to 512 x 512 pixel resolution. Consequently, you don't need to provide higher resolution images. ( https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-multimodal-embeddings )

Is there any logic that uses the detected dominant object in the image to resize it? Or should I be using Cloud Vision Crop Hints API, then resize to 512x512 ensuring the dominant object is in the bounding box? Also is it possible to get the resized image/coordinates from the model to check what it is processing? If not, this would be very useful to add in a future release.

I'm guessing the answer will be "No, yes, no" but just checking.

nceniza

It is a good suggestion to use Vision crop hints but I would still recommend filing this as a feature request for vision AI also "lso is it possible to get the resized image/coordinates from the model to check what it is processing" seems to be an internal processing of the model but might be a good idea to include to the FR for atleast to get a general idea to see the model is processing.

https://cloud.google.com/support/docs/issue-trackers#feature_requests

Multimodal embeddings resized to 512x512 - any logic to resize around the dominant object?