Re: How to know if BQ query was run from console

joonvena1 · 04-24-2024 12:20 AM

I have log sink where i'm routing jobservice.jobcompleted to Pub/Sub. Idea of this is to check that all the jobs have labels defined. As you can't define labels for jobs that are run from the BQ UI i'm trying to figure out best way to ignore those runs. Only thing i have noticed from the data that could be used is the userAgent information.

Edit:
You can actually define the labels also on the UI side using the system-variable @@query_label eg.:

SET @@query_label = "environment:dev, application:test"; <query>;

But this then creates two jobservice.jobcompleted events where the other one contains the labels but other doesn't so this would trigger missing labels even when they are there and didn't find any sensible info to tie these events together. So i would rather skip everything coming from the UI.

ms4446

Hi @joonvena1 ,

Here's a possible approach that combines both filtering and processing:

1. Pub/Sub Filtering:

Start by setting up a filter on your Pub/Sub subscription to exclude messages that likely originate from the UI. Here's an example you can adapt:

attributes.type="jobservice.jobcompleted" AND NOT attributes.userAgent CONTAINS "google-cloud-sdk" AND NOT attributes.userAgent CONTAINS "BigQuery Java" 

2. Cloud Function:

Use a Cloud Function triggered by the filtered messages to perform more detailed checks:

Extract job ID and user agent.
Use refined regular expressions to identify UI-based jobs.
Check for labels and annotate the message accordingly.

Here's a Python example for the Cloud Function:

import base64
import json
import re

def process_job_completion(event, context):
    message_data = base64.b64decode(event['data']).decode('utf-8')
    message = json.loads(message_data)

    user_agent = message.get('attributes', {}).get('userAgent', '')
    job_id = message.get('attributes', {}).get('jobId', 'Unknown')

    ui_patterns = [r'^Mozilla/', r'^Chrome/'] 
    is_ui_job = any(re.match(pattern, user_agent) for pattern in ui_patterns)

    labels = message.get('attributes', {}).get('labels', {})
    has_labels = bool(labels)

    if not has_labels and not is_ui_job:
        print(f"Alert: Job {job_id} completed without labels and is not UI-originated.")
    elif is_ui_job:
        print(f"Info: Job {job_id} is from UI; label check is skipped.")
    else:
        print(f"OK: Job {job_id} has labels.")

    return { 
        'jobId': job_id, 
        'userAgent': user_agent, 
        'isUiJob': is_ui_job, 
        'hasLabels': has_labels 
    }

Additional Considerations:

Monitoring User Agent Changes: Be vigilant about updates to BigQuery's UI, which may alter userAgent strings, thus necessitating updates to your filter criteria.
Managing False Positives: Regularly refine your pattern matching to maintain accuracy and reduce false positives.

joonvena1

Hi @ms4446 thanks for this. This is something that i already have in place. Was just wondering if there would be some other attribute / way than using the userAgent information to distinct if the query is coming from UI. I currently have the processing done inside Cloud Run but your suggestion 1 makes more sense to just filter it on the Pub/Sub subscription level. Much easier to update this compared to the processing application.

joonvena1

Cant actually use the Pub/Sub filtering because it can only access the message attributes and not the actual data that is in the message. As the event is coming through the logging sink i can't affect the message attributes. Don't want to make another processing step that would insert the userAgent informaton to the message attributes so i'm going to go with the solution 2 and still utilize the Cloud Run.

ms4446

Hi @joonvena1 ,

Given that Pub/Sub filtering isn’t viable because it only accesses message attributes and not the payload where your userAgent information is located, focusing on refining your existing Cloud Run solution is a sound strategy.

Since you’re interested in exploring alternatives to using the userAgent to distinguish UI-originated queries, here are a couple of additional ideas you might consider:

Log Analysis Enhancements:
- Custom Metadata: If modifying the BigQuery UI job submissions directly isn't an option, consider whether any other metadata automatically included in logs could serve as a reliable indicator. For instance, checking for specific project IDs, user account details, or job configurations that are typically associated with UI submissions might help.
- Advanced Parsing Techniques: Implement more sophisticated log parsing within your Cloud Run service to extract and analyze potential identifiers that are less obvious than userAgent. This might include parsing out session or context specifics that are unique to UI sessions.
Integration with IAM Policies:
- User Roles and Activity: Sometimes, the roles assigned to users or the type of activity they are allowed to perform can indicate whether a submission is likely from the UI. You could potentially cross-reference this with IAM logs to filter out jobs initiated via the UI based on user permissions or roles.
Enhanced Cloud Logging Filters:
- Complex Log Queries: Since you're using a logging sink, you might be able to construct more complex queries that take into account a combination of fields from the log data itself. For example, combining userAgent with other log parameters that might indirectly reveal UI interactions, such as specific API calls that are common to UI use cases.
Using Audit Logs:
- Behavioral Patterns: Audit logs can sometimes provide deeper insights into how services are being used. By analyzing patterns in audit logs, you may be able to devise a heuristic or algorithm that identifies UI-based interactions based on behavior rather than direct metadata.

Each of these strategies has its own set of challenges and would require some experimentation to refine. However, they could potentially provide you with additional filters or mechanisms to accurately segment the jobs as you intend.

If these alternatives still don’t meet your needs, maintaining and enhancing the Cloud Run process to handle complex conditions and transformations, as you mentioned, is likely your best path forward.