Re: DLP file scanning for more than 0.5MB size

Kv2 · 05-11-2023 12:52 PM

I want to perform cloud DLP file scanning for file size more than 0.5 MB without uploading them to cloud storage as per the use case. Is there any way to increase quota limits using DLP service client in code snippet. Thanks

kolban

I think you've already seen this page:

https://cloud.google.com/dlp/limits

The docs say:
> If you need to inspect files that are larger than these limits, store those files on Cloud Storage and run an inspection job.

I went to the quotas page on my project and couldn't see an obvious over-ride. Since I work for Google, I searched some of our feature requests and found one that we identify as "155516840" where another customer raised a request to increase the default/limit. That has NOT yet been resolved and remains open. I have added a link to this conversation into that request.

ste_google

Content methods have a payload limit so that they can operate on the data and respond synchronously without persisting anything. For images larger files are supported.

What types of files are you looking to process?

Kv2

@ste_google I have all sets of files to scan including docx, pdf's, text, images and other extensions as well. Sometimes files contains only text content like a .txt file whose size goes beyond 4 to 5 MB and expectation is to get sync response.

ste_google

Currently the supported way to scan those larger files would be to leverage a storage bucket (even temporarily) and our deep inspection jobs. Here is an example pattern to do this in an event driven way: https://cloud.google.com/dlp/docs/automating-classification-of-data-uploaded-to-cloud-storage

Using this method you could easily have it delete the files once scanning is done.

Would this work?

Kv2

The response time for these inspection jobs are about 1 minute and more than that. The uses-case i have is to do real time scanning, if not more than 5 MB files but atleast 4 to 5 MB instead of 0.5 MB that DLP support currently.

ste_google

We don't currently have an option for that. Do you work with anyone on the account side that could help you file a feature request and we can get more details?

Kv2

I was just exploring this service and other clouds, if it can be fitted for my use case or not. Thanks for the responses.

abhijitp

Hi,Sorry to jump in on this question.The inspection jobs that we create to scan the file(s) stored on GCS buckets,scans all the files everytime and once in a day correct?Is there a way we can configure the inspection job to run say hourly basis and scan only the new files that have arrived on the bucket rather than scan everything again?

ste_google

Inspection jobs are not limited to only scanning "all the files everytime." You can control exactly what is scanned:

You can configure Cloud Storage Inspection jobs to run on create (on demand).
You can also specify a timespan to only include files created or modified within a certain window or after a certain time. You could set this to only scan data since the last scan.

So you should be able to do everything that you asked here.

We also provide JobTriggers that will do most of this for you on an automatic schedule (scanning only new data). However, right now those have a min interval of 1 day.

Hope this helps.

abhijitp

Apologies for the late reply.Thanks a lot @ste_google ,it worked and we have applied the solution.The next step is for us to redact the GCS bucket file that has been identified as sensitive.I did go through a lot of documents online for redacting text,images but is there a way of redacting an already uploaded GCS file(similar to how we provide GCS location for inspection job) ,so that redaction takes place in the GCS bucket?

ste_google

Currently the de-identification action is connected to an inspection job. So you can apply it to an existing file, folder, or set of files just as you can an inspection job. You specify the source/input in the same way you do for an inspection job but then just turn on the "Make a de-identified copy" action:

re: "so that redaction takes place in the GCS bucket?"

Currently the action for de-identification/redaction requires that the output files be written into another bucket and it will replicate the file/folder structure appropriately to avoid any issues/errors of overwriting of source files during a scan. Will this work for you?

One caveat is that not all file formats are supported for redaction today. You can get more details here: https://cloud.google.com/dlp/docs/supported-file-types

Hope this helps.

abhijitp

Thanks @ste_google ,that was helpful.However if i am passing all configuration to an inspection job via an api call.Is there something that i can set in the inspection job configuration request,to enable the "Make a de-identified copy" within the request? ....Updated -> I found the answer here -> https://cloud.google.com/dlp/docs/deidentify-storage .Thank you .

ste_google

Yes, you can enable this via the API directly by adding in the action and details for deidentify

Snippet in Java

  Action.Deidentify deidentify =
          Action.Deidentify.newBuilder()
              .setCloudStorageOutput(outputDirectory)
              .setTransformationConfig(transformationConfig)
              .setTransformationDetailsStorageConfig(transformationDetailsStorageConfig)
              .addAllFileTypesToTransform(fileTypesToTransform)
              .build();

Snippet in Python

    actions = [
        {
            "deidentify": {
                "cloud_storage_output": f"gs://{output_gcs_bucket}",
                "transformation_config": transformation_config,
                "transformation_details_storage_config": {
                    "table": big_query_table
                },
                "file_types_to_transform": ["IMAGE", "CSV", "TEXT_FILE"],
            }
        }
    ]

Here are full code samples in several langauges: https://cloud.google.com/dlp/docs/deidentify-storage#dlp_deidentify_cloud_storage-java

Is that what you were looking for?

abhijitp

Yes thats the one,thanks a lot @ste_google .Our apis are on Go,so this should help me -> https://cloud.google.com/dlp/docs/deidentify-storage .

ste_google

That's great. Go should be on there too.

Go Snippet

deidentify := &dlppb.Action_Deidentify{
                TransformationConfig:               transformationConfig,
                TransformationDetailsStorageConfig: transformationDetailsStorageConfig,
                Output: &dlppb.Action_Deidentify_CloudStorageOutput{
                        CloudStorageOutput: outputDirectory,
                },
                FileTypesToTransform: fileTypesToTransform,
        }

        action := &dlppb.Action{
                Action: &dlppb.Action_Deidentify_{
                        Deidentify: deidentify,
                },
        }

        // Configure the inspection job we want the service to perform.
        inspectJobConfig := &dlppb.InspectJobConfig{
                StorageConfig: storageConfig,
                InspectConfig: inspectConfig,
                Actions: []*dlppb.Action{
                        action,
                },
        }