BigQuery & GA4 - session source / medium

Hi, I am working on a dashboard that is based on BigQuery
there are several conversions I am sending to BigQuery and I am trying to get the session scoped source and medium like analytics provided in the UI, unfortunately when observing the records of my conversion events I see no presence for the session scoped utm's so I am left with null records
is there a solution to this scenraio or a workaround, currently I am measuring things based on first user scoped utm's because I have no other option and it raises a lot of questions when comparing to GA4 UI

thanks in advance 🙂

Solved Solved
0 13 8,860
1 ACCEPTED SOLUTION

Dealing with Discrepancies in Direct/None Traffic and Ecommerce Metrics in GA4 BigQuery Queries

Understanding the Challenges:

  • Discrepancies in fetching direct/none traffic data.
  • Issues with additional metrics like ecommerce.purchase_revenue.

Strategies to Address These Issues:

1. Understanding Direct/None Traffic

  • Identification: Ensure your queries correctly identify direct/none traffic, which typically occurs when there's no referrer information or the source/medium cannot be determined.
  • Session Start Logic: Review how sessions are identified, as direct/none traffic should be associated with the first event of a new session or a session_start event.

2. Refining Query for Direct/None Traffic

  • Query Adjustment: Modify your SQL query to accurately capture direct/none traffic by checking for null or missing values in the source/medium fields and labeling them accordingly.
  • Example Query Snippet:
 
CASE WHEN traffic_source.source IS NULL OR traffic_source.source = '' THEN 'direct' ELSE traffic_source.source END AS source,
CASE WHEN traffic_source.medium IS NULL OR traffic_source.medium = '' THEN '(none)' ELSE traffic_source.medium END AS medium

3. Handling ecommerce.purchase_revenue

  • Data Consistency: Ensure consistent population of ecommerce.purchase_revenue to avoid mixed results.
  • Join Logic: Verify correct join conditions when working with multiple tables or datasets to align revenue data with corresponding sessions and traffic sources.

4. Query Optimization for Additional Metrics

  • Metric Alignment: Align queries with GA4's calculation of ecommerce.purchase_revenue, potentially summing up revenue per session or user.
  • Aggregation and Grouping: Properly aggregate and group data to reflect correct revenue figures at the appropriate level (session or user).

5. Testing and Validation

  • Small Data Sets: Test queries on smaller subsets to ensure expected results.
  • Compare with GA4 Interface: Regularly compare query results with standard reports in the GA4 interface for consistency.

6. Advanced Debugging

  • Event-Level Analysis: Dive into event-level data to understand how different events contribute to metrics.
  • Data Layer Inspection: Review data layer implementation on your website to ensure accurate capture of events and traffic source data.

View solution in original post

13 REPLIES 13

There isn't a direct method to extract session-scoped source and medium from BigQuery conversion events. However, you can employ the following workaround:

Step 1: Construct a new table in BigQuery that captures the session-scoped source and medium for each session:

 

SELECT session_id,
       FIRST_VALUE(traffic_source) OVER (PARTITION BY session_id ORDER BY event_timestamp ASC) AS session_scoped_source,
       FIRST_VALUE(traffic_medium) OVER (PARTITION BY session_id ORDER BY event_timestamp ASC) AS session_scoped_medium
FROM `[YOUR_PROJECT_ID].[YOUR_DATASET_ID].[YOUR_EVENT_TABLE_ID]`

Step 2: Merge the conversion event table with the newly created session-scoped source and medium table using the session_id column.

Step 3: Incorporate the session_scoped_source and session_scoped_medium columns in your dashboard to evaluate the efficacy of your conversions based on session-scoped source and medium.

For instance:

 

SELECT event_name,
       COUNT(*) AS conversion_count,
       session_scoped_source,
       session_scoped_medium
FROM `[YOUR_PROJECT_ID].[YOUR_DATASET_ID].[YOUR_EVENT_TABLE_ID]`
JOIN `[YOUR_PROJECT_ID].[YOUR_DATASET_ID].[YOUR_SESSION_SCOPED_SOURCE_MEDIUM_TABLE_ID]`
ON `[YOUR_PROJECT_ID].[YOUR_DATASET_ID].[YOUR_EVENT_TABLE_ID]`.session_id = `[YOUR_PROJECT_ID].[YOUR_DATASET_ID].[YOUR_SESSION_SCOPED_SOURCE_MEDIUM_TABLE_ID]`.session_id
GROUP BY event_name,
       session_scoped_source,
       session_scoped_medium
event_name conversion_count session_scoped_source session_scoped_medium
purchase 100 google organic search
purchase 50 facebook social media
signup 200 direct (direct) / (none)

Always test the queries on a subset of your data initially to ensure accuracy and optimal performance.

Hi, thank you very much, I will definitely try this workaround

I tried it and I did manage to generate a report but I still got a lot of difference when comparing my data set to the exploration report in GA4, any suggestions?

 

There are a few possible reasons why you might be seeing differences between your BigQuery report and the exploration report in GA4:

  • Sampling: BigQuery reports are based on unsampled data, while exploration reports may be sampled, especially for large datasets. This means that the two reports may produce slightly different results.
  • Data modeling: GA4 uses a number of machine learning models to impute missing data and predict user behavior. These models can sometimes produce different results than the raw data in BigQuery.
  • Data retention: GA4 data is retained for 24 months by default, while BigQuery data can be retained for longer periods of time. This means that the two reports may include different date ranges.
  • Data processing: BigQuery reports are processed differently than exploration reports. This can sometimes lead to slight differences in the results.

To troubleshoot the differences between your two reports, you can try the following:

  • Compare the date ranges of the two reports. Make sure that you are comparing the same data range in both reports.
  • Check the sampling settings for your exploration report. If sampling is enabled, try disabling it to see if the results match your BigQuery report more closely.
  • Filter the data in your BigQuery report to match the data in your exploration report. For example, you can filter the BigQuery report to only include data from the same date range as the exploration report.
  • Use the same metrics and dimensions in both reports. Make sure that you are comparing the same metrics and dimensions in both reports.

Here are some additional tips for comparing your BigQuery report to the exploration report in GA4:

  • Use the same reporting identity for both reports. GA4 has four reporting identity spaces: User ID, Device ID, Google Signals, and Combined Signals. Make sure that you are using the same reporting identity for both reports.
  • Use the same filters for both reports. If you are filtering the data in one report, make sure to apply the same filters to the other report.
  • Compare the results by event dimension. This will help you to identify any specific events that are causing the differences between the two reports.
  • Compare the results by date range. This will help you to identify any specific date ranges that are causing the differences between the two reports.

hey, 

I am facing a similar issue and I tried many different workarounds to fetch the required results, however it does not work properly. The main issues is that it does not gets direct/ none data correctly and additionally if you add an extra metric such as ecommerce.purchase_revenue the results also get all mixed up. Is there a way to fix these? 

Dealing with Discrepancies in Direct/None Traffic and Ecommerce Metrics in GA4 BigQuery Queries

Understanding the Challenges:

  • Discrepancies in fetching direct/none traffic data.
  • Issues with additional metrics like ecommerce.purchase_revenue.

Strategies to Address These Issues:

1. Understanding Direct/None Traffic

  • Identification: Ensure your queries correctly identify direct/none traffic, which typically occurs when there's no referrer information or the source/medium cannot be determined.
  • Session Start Logic: Review how sessions are identified, as direct/none traffic should be associated with the first event of a new session or a session_start event.

2. Refining Query for Direct/None Traffic

  • Query Adjustment: Modify your SQL query to accurately capture direct/none traffic by checking for null or missing values in the source/medium fields and labeling them accordingly.
  • Example Query Snippet:
 
CASE WHEN traffic_source.source IS NULL OR traffic_source.source = '' THEN 'direct' ELSE traffic_source.source END AS source,
CASE WHEN traffic_source.medium IS NULL OR traffic_source.medium = '' THEN '(none)' ELSE traffic_source.medium END AS medium

3. Handling ecommerce.purchase_revenue

  • Data Consistency: Ensure consistent population of ecommerce.purchase_revenue to avoid mixed results.
  • Join Logic: Verify correct join conditions when working with multiple tables or datasets to align revenue data with corresponding sessions and traffic sources.

4. Query Optimization for Additional Metrics

  • Metric Alignment: Align queries with GA4's calculation of ecommerce.purchase_revenue, potentially summing up revenue per session or user.
  • Aggregation and Grouping: Properly aggregate and group data to reflect correct revenue figures at the appropriate level (session or user).

5. Testing and Validation

  • Small Data Sets: Test queries on smaller subsets to ensure expected results.
  • Compare with GA4 Interface: Regularly compare query results with standard reports in the GA4 interface for consistency.

6. Advanced Debugging

  • Event-Level Analysis: Dive into event-level data to understand how different events contribute to metrics.
  • Data Layer Inspection: Review data layer implementation on your website to ensure accurate capture of events and traffic source data.

@ms4446 Hi, I have a question about "2. Refining Query for Direct/None Traffic".

In my understanding, "traffic_source" does not provide event-scoped traffics, but it provides user-scoped traffics, which provides where the user first came from on the website.

Instead, if I am right, we should use "collected_traffic_source", which is a newly added record on June 2023, or values in parameters like "campaign", "source" or "medium" in event_params record before June 2023 to fetch event-scoped traffics. This is supported in other articles like below.

https://mu-kong.medium.com/get-session-campagin-from-ga4-exported-bigquery-data-fcb568afc44

https://tanelytics.com/ga4-bigquery-session-traffic_source/

I am concerned if my understanding is wrong, so please let me know the correct way to get session-scoped traffics when you finds out.

Thanks.

Your understanding is correct, and your approach aligns with the evolving nature of GA4's data schema and how session-scoped traffic sources are captured in BigQuery. 

Initial Limitations:

  • Early GA4 BigQuery exports mainly provided user-scoped traffic source information (reflected in the traffic_source fields). This shows the first source a user arrived from.

Updates and Enhancements:

  • GA4 and its BigQuery export schema have been updated, including the introduction of the collected_traffic_source field. This provides a more accurate way to understand traffic sources for individual sessions or events.

Using collected_traffic_source

  • This field gives session-scoped traffic source information, letting you attribute sessions and their events to the most recent traffic source. This offers a more granular view of user behavior and traffic source effectiveness.

Event-Scoped Traffic Before the Update:

  • Analysts used event_params to extract event-scoped traffic source information. This involved parsing parameters like campaign, source, or medium. It required more work, but was the main method before collected_traffic_source.

Your Approach

  • For session-scoped traffic sources in data post-June 2023, use the collected_traffic_source field.
  • For data before the update or for additional event-specific insights, you may still need to parse event_params.

Example Queries:

Using collected_traffic_source:

 
SELECT 
    event_name, 
    event_timestamp, 
    collected_traffic_source.source, 
    collected_traffic_source.medium, 
    collected_traffic_source.campaign 
FROM `your_project.your_dataset.events_*` 
WHERE _TABLE_SUFFIX BETWEEN 'YYYYMMDD' AND 'YYYYMMDD' 

Extracting event-scoped traffic from event_params:

 
SELECT 
    event_name, 
    event_timestamp, 
    (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'source') AS source, 
    (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'medium') AS medium, 
    (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'campaign') AS campaign 
FROM `your_project.your_dataset.events_*` 
WHERE _TABLE_SUFFIX BETWEEN 'YYYYMMDD' AND 'YYYYMMDD' 

Your approach of using collected_traffic_source for session-scoped traffic analysis and event_params (for data before the update) is accurate and recommended for detailed traffic source analysis in GA4 BigQuery exports.

@ms4446 

I am relieved to hear that my approach is accurate and is a recommended way. Thank you for your quick and detailed answer. 

Hi @ms4446 ,

I have tried to count the sessions using 

collected_traffic_source.manual_source,
collected_traffic_source.manual_medium,
COUNT(DISTINCT user_pseudo_id, (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id'))) sessions_count

on my dataset for event period after June 2023 but the result somehow still have huge discrepancy with the sessions source/medium in GA4 Interface. I have tried to use traffic_source.source and traffic_source.medium but the result still far different especially for Google / Organic and Direct . What is the best practice to get the nearest result as in GA4?

Hello,

Is there a way to get the ecommerce.purchase_revenue and ecommerce.purchase_revenue_in_usd values for non-null source, mediums like google, facebook etc?

Tried using both collected_traffic_source and event_params methods above but all total_revenue and revenue_usd seems to be tagged to null source, medium.

Sample of simple sql below.

 

 

SELECT
    ((SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'source')) AS source,
    ((SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'medium')) AS medium,
    SUM(ecommerce.purchase_revenue) AS revenue,
    SUM(ecommerce.purchase_revenue_in_usd) AS revenue_usd
  FROM
    `datasource`
  GROUP BY
    source,
    medium

 

Results look like example below

owsourcemediumrevenuerevenue_usd

1
null
null
1000
10
2
google
cpc
null
null
3
adroll
Email
null
null
4
Instagram
display
null
null
5
google
organic
null
null

Thank you in advance!

j

The issue you're encountering, where ecommerce.purchase_revenue and ecommerce.purchase_revenue_in_usd are primarily associated with null sources and mediums, highlights a challenge in accurately attributing revenue to specific traffic sources within your GA4 BigQuery data. Here's why this happens:

  • Session Attribution: GA4 attributes ecommerce events like purchases to the session in which they occur. Sometimes the source and medium are linked to the session itself, rather than the specific purchase event.
  • Event and Session Scope: Traffic source information might be recorded at the start of a session or with specific events, but purchase events might not directly contain it. This leads to null values in your queries.

Refined Approach for Attribution

To get better attribution, focus on session-level or user-level data when analyzing purchase events. Here's what to do:

  1. Linking Purchase Events to Sessions: Determine the session where a purchase event occurred. Then, you can attribute the purchase to that session's traffic source.
  2. Leverage collected_traffic_source (Post-June 2023 Data): This field gives more accurate session-level attribution for data collected after June 2023.
  3. Aggregate Revenue at the Session Level: Instead of direct attribution, group purchase revenue by session and match to the session's traffic source.

Example Query

Here's a query demonstrating how to attribute revenue to sessions based on traffic source (assuming you can link purchase events to sessions):

 
WITH SessionTraffic AS ( 
    SELECT 
        session_id, 
        MAX(IF(key = 'source', value.string_value, NULL)) AS source, 
        MAX(IF(key = 'medium', value.string_value, NULL)) AS medium 
    FROM `datasource`, 
    UNNEST(event_params) 
    WHERE event_name = 'session_start' 
    GROUP BY session_id ), 

PurchaseRevenue AS ( 
    SELECT 
        session_id, 
        SUM(ecommerce.purchase_revenue) AS revenue, 
        SUM(ecommerce.purchase_revenue_in_usd) AS revenue_usd 
    FROM `datasource` 
    WHERE event_name = 'purchase' 
    GROUP BY session_id ) 

SELECT 
    st.source, 
    st.medium, 
    SUM(pr.revenue) AS revenue, 
    SUM(pr.revenue_usd) AS revenue_usd 
FROM SessionTraffic st 
JOIN PurchaseRevenue pr 
    ON st.session_id = pr.session_id 
GROUP BY st.source, st.medium 
  • Data Completeness: Your GA4 implementation should capture traffic source information at session start and track purchase events correctly.
  • Session Linkage: Ensure you can accurately connect purchase events to their corresponding sessions.

@ms4446 

This is amazing. 

Tested it out and while a bulk are still attributed to null (around 60-70%), I can see other source/mediums showing non-zero/null values for revenue now.

Thank you so much!