Solved: Accessing large file in Google Storage from Google...

magenti · 02-23-2023 07:50 AM

Hi,

I would like to use Google Run to make video processing. Which is the fastest way for Google Run to access data stored in Google Storage?

I tied to download the 4 chunks that make up a GoPro video. Each chunk is roughly 4Gb. Downloading locally the 4 chunks takes 5 minutes using the blob.download_to_filename() method.

Is there a faster way to access data?

kolban

Howdy Mauro. I heard you say that there is no need for real-time response. To my eyes, the API you are using (download_to_filename) is likely going to be either "just fine" or maybe even "the best there is". If I were sitting in your seat, I'd probably suggest measuring how long it takes to download (from GCS) the data you want to process INSIDE your application. For example, note the time that you request the files to be downloaded and note the time when that activity is completed. You will now have a duration. Divide the size of the files by the duration and you will now have a rate (bandwidth). That will be an interesting metric to discuss. Does the metric "feel" correct. Is it close to the advertized bandwidth of data egress from Cloud Storage? If the answer is yes, you are close to optimal (we can't do better get GCS data than what it advertizes for egress bandwidth). However, if the answer is not great ... then next we have to ask ... where is the bottleneck? For example, if you can only write XXX MBytes/second to a file in your VM and the bandwidth of data arriving is YYY MBytes/second ... you will be constrained to which ever is lower. As such, any issue may not be GCS related at all.

View solution in original post

robertcarlos

Hi @magenti,

Welcome to Google Cloud Community!

Can you check if this similar Cloud Community question address your concern?

kolban

I think it might help if we contemplate the nature of your application as a whole. I think I am hearing that you have 16GB video files and you want to process them. These files are starting their life as objects in Google Cloud Storage and exist as multiple 4GB chunks. Is this the design you have chosen or is this dictated by some upstream technical reasons?

Next I am hearing that you want to process this video data. You are thinking of using Cloud Run to process the data. I am sensing that you want to send a request to Cloud Run identifying the video files that you want to process, you want Cloud Run to "do its work" and then, presumably, store its results somewhere (maybe Cloud Storage). Is this correct?

Now comes the heart of the story. How long do you anticipate Cloud Run to take to perform its work? How many requests per unit of time do you anticipate it processing? Do you have any requirements for your solution ... for example ... do you need to "process the data in less than 10 minutes/5 minutes/3 minutes"? Do you have any other requirements ... for example cost ... (I'm making this up) ... if I said it cost X to process an image in 10 minutes, would you be willing to pay 2X to have it processed in 5 minutes?

Now lets get back to your technical question ... assuming you are processing the data and the data exists in multiple objects in Cloud Storage, how are you processing the data? Is the data in any kind of format where it can be stream processed? For example, can we process the data as chunks of it arrive instead of waiting for the full download to a Docker container to complete before we start processing?

The underlying API provided by Google Cloud Storage is the Object: get method. This API takes the identity of the object you want as input and starts to stream the content of that data a response. Wrapper libraries can then write it to a file or present it to your application as portions of it arrive.

So ... when you ask "is there a faster way to access the data" ... what I am sensing is that you are downloading the GCS object, saving it to a file in your Docker container and then reading the file. To understand if we can do this faster, we must first ask ... what are the constraints today? Is it network latency between your Compute and Cloud Storage, is it paging in your OS, is it waiting for the whole file to be downloaded before we start processing it ... etc.

magenti

Hello Kolban,
first of all, thank you for your availability to help
Your intuitions are all correct. Let me recap them one by one:
yes, the videos my pipeline needs to treat are divided into chunks of 4GB each because GoPro divides long videos this way.
Yes, I use Cloud Run to process the (chunks of) videos being uploaded. In other words, I have an app in Dart/Flutter in which, upon the push of a button, I send an HTTPS request to trigger the Cloud Run service I need. The service does its job and saves the results in Google Storage. But in this case, the size of the generated results is not an issue because the size of the results file is limited and the transfer is fast.
I do not have special requirements. I do not have real-time constraints. And no, I would not pay 2x the cost to have results in half time. At the same time, efficiency is always a requirement. This is why I am questioning if the way I am accessing the chunks of video in Google Storage is the best one or if there are better ways to do the same at the same price.
At the moment data are processed only when the whole upload has been completed. So if a user uploads (say) a video made of 3 chunks, only when all 3 of them are in G storage, the google run service gets triggered. This is because I need to keep in the results the same order of the input chunks. So at the moment, in flutter I have a synch instruction to upload the multiple chunks. Once this instruction is completed, the HTTPS request is sent with the chunks link as an argument. . This way of proceeding helps me to keep in the results the same order of the chunks in input. This is the way the code is implemented at the moment but I guess I can easily change the code to make the processing of each chunk as it is received in Google Storage in case this is needed.
Yes, I am downloading the target file(chunk) in the Google Run container through the method:

storage_client.get_bucket(bucket).blob(object_name).download_to_filename(local_path)

Since I am not a cloud exert I am wondering if this is the best approach or if there are alternative, faster ways to feed the google run
Well this is all, I hope I have not skipped any of your questions.
Let me thank you again for your kind help
Best regards
Mauro

kolban

Howdy Mauro. I heard you say that there is no need for real-time response. To my eyes, the API you are using (download_to_filename) is likely going to be either "just fine" or maybe even "the best there is". If I were sitting in your seat, I'd probably suggest measuring how long it takes to download (from GCS) the data you want to process INSIDE your application. For example, note the time that you request the files to be downloaded and note the time when that activity is completed. You will now have a duration. Divide the size of the files by the duration and you will now have a rate (bandwidth). That will be an interesting metric to discuss. Does the metric "feel" correct. Is it close to the advertized bandwidth of data egress from Cloud Storage? If the answer is yes, you are close to optimal (we can't do better get GCS data than what it advertizes for egress bandwidth). However, if the answer is not great ... then next we have to ask ... where is the bottleneck? For example, if you can only write XXX MBytes/second to a file in your VM and the bandwidth of data arriving is YYY MBytes/second ... you will be constrained to which ever is lower. As such, any issue may not be GCS related at all.

magenti

Thank you Kolban, it makes lot of sense. I will time the downloading time bye the file size and see what i ebd up with. Thank you for your valuable help!

Accessing large file in Google Storage from Google Run