Dataform - Documentation Generation

Hi, 

Does any know how we can generate documentation i.e. our model's documentation? 

Like in DBT, we can generate docs there.

Thanks

Solved Solved
1 12 1,032
6 ACCEPTED SOLUTIONS

Hi @davidregalado25 ,

Here are a few things to check:

  1. Check Dataform Compilation:

    • Compile After Changes: Did you compile your Dataform project after updating the descriptions in test_david_ILS_categories.sqlx? Changes won't show up until you do.
    • Review Compilation Logs: Look for any errors in the logs that might be related to descriptions.
  2. Permissions:

    • Verify Permissions: Make sure the Dataform service account has permissions to edit table metadata in BigQuery (roles like "BigQuery Data Editor" or "BigQuery Job User").
  3. BigQuery Metadata:

    • Allow Time for Updates: BigQuery can take a little while to reflect changes. Give it some time, then check the descriptions again in the BigQuery UI or query INFORMATION_SCHEMA.COLUMNS.
  4. Syntax or Formatting Issues:

    • Check Indentation and Typos: Double-check the indentation in your config block (in test_david_ILS_categories.sqlx) and ensure there are no typos in your column names.

View solution in original post


  1. Check Dataform Compilation:

    • Compile After Changes: Did you compile your Dataform project after updating the descriptions in test_david_ILS_categories.sqlx? Changes won't show up until you do.
    • Review Compilation Logs: Look for any errors in the logs that might be related to descriptions.

The UI says COMPILED. I'm using the UI, if there are more functionality in the CLI, I think it should be included in the UI too.


Permissions:
    • Verify Permissions: Make sure the Dataform service account has permissions to edit table metadata in BigQuery (roles like "BigQuery Data Editor" or "BigQuery Job User")
  1.  

It can create the tables. Only the descriptions are missing.

 BigQuery Metadata:
  • Allow Time for Updates: BigQuery can take a little while to reflect changes. Give it some time, then check the descriptions again in the BigQuery UI or query INFORMATION_SCHEMA.COLUMNS.
There's anything about descriptions when running

select
* from ILS.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'test_david_ILS_categories'
 
  1. Syntax or Formatting Issues:

    • Check Indentation and Typos: Double-check the indentation in your config block (in test_david_ILS_categories.sqlx) and ensure there are no typos in your column names.

Is indentation important? I clicked on the FORMAT button. No luck.

--
Best regards
David Regalado
Web | Linkedin | Cloudskillsboost

View solution in original post

Hi @davidregalado25 ,

You mentioned that the Dataform service account can create tables, which is a positive sign. However, there might still be a permissions issue specifically related to modifying table metadata. Please double-check that the service account has the "BigQuery Data Editor" role at the dataset or project level without any restrictions that might prevent metadata changes.

BigQuery sometimes delays reflecting metadata changes in both its UI and the INFORMATION_SCHEMA queries. Although it's frustrating, giving it a bit more time might resolve the issue. If, after waiting, the descriptions still don't appear, the problem likely lies elsewhere.

You've done the right thing by using the FORMAT button in Dataform's UI. However, manual changes after formatting can introduce errors. Indentation is crucial in YAML-like configurations, as it helps define the structure. Here's the correct structure for your SQLX file's config block:

 
config {
  type: "table",
  database: "your_project_id",
  schema: "ILS",
  name: "test_david_ILS_categories",
  description: "A table description here",
  columns: {
    category_id: {
      description: "Description of category_id",
      type: "STRING"
    },
    category_name: {
      description: "Description of category_name",
      type: "STRING"
    }
  }
}

View solution in original post

I got this error:

Unexpected property "category_id" in columns column descriptor. Supported properties are: ["description","columns","displayName","dimension","aggregator","expression","tags","bigqueryPolicyTags"]

Screenshot 2024-04-09 at 23.43.43.png

--
Best regards
David Regalado
Web | Linkedin | Cloudskillsboost

View solution in original post

Hi @davidregalado25 ,

The error message you received suggests that the way you're attempting to define the column descriptions and types is not recognized by Dataform. Specifically, the properties you should be focusing on within the columns section are those listed in the error message, such as description, displayName, and tags.

You should adjust the configuration to correctly define the column descriptions. Here's how you can structure your table and column descriptions in the SQLX file, focusing on incorporating the supported properties correctly:

 
config {
    type: "table",
    database: "your_project_id",
    schema: "ILS",
    name: "test_david_ILS_categories",
    description: "A table description here",
    columns: {
        category_id: {
            description: "Description of category_id",
            tags: ["example_tag"] // Note: Remove the type property as it's not supported here
        },
        category_name: {
            description: "Description of category_name",
            tags: ["example_tag"] // Note: Similarly, remove the type property
        }
    }
}

In this corrected format, I've removed the unsupported type property from the column descriptors and ensured that only supported properties are included. If you were attempting to specify data types for these columns directly in the SQLX file, it's important to note that Dataform uses this configuration primarily for documentation and metadata purposes, rather than for defining the schema of the table directly in BigQuery.

Additional Steps

  1. Apply the Corrected Configuration: Adjust your SQLX file according to the corrected format and redeploy your Dataform project.
  2. Verify Changes in BigQuery: After redeployment, check again in BigQuery to see if the column descriptions are now appearing as expected.
  3. Review Dataform Documentation: For further clarification on defining column properties and any additional configuration options, reviewing the Dataform documentation can be very helpful.

View solution in original post

I've created a new thread here.

--
Best regards
David Regalado
Web | Linkedin | Cloudskillsboost

View solution in original post

12 REPLIES 12

In Google Cloud Dataform, you can generate documentation for your models, similar to what you can do in DBT. Here's how you can do it:

Overview of Dataform Core

Dataform core is an open-source meta-language used to create SQL tables and workflows. It extends SQL by providing features like dependency management, automated data quality testing, and data documentation.

Documenting Tables and Columns

You can document your tables and their fields directly within the config block of a SQLX file. This documentation is then pushed directly to BigQuery, and you can parse this documentation to push it out to other tools.

Here's an example of how you can define the documentation in a config block of a SQLX file:

config {

  type: "table",

  description: "This table joins orders information from OnlineStore & payment information from PaymentApp",

  columns: {

    order_date: "The date when a customer placed their order",

    id: "Order ID as defined by OnlineStore",

    order_status: "The status of an order e.g. sent, delivered",

    customer_id: "Unique customer ID",

    payment_status: "The status of a payment e.g. pending, paid",

    payment_method: "How the customer chose to pay",

    item_count: "The number of items the customer ordered",

    amount: "The amount the customer paid"

  },

  assertions: {

    uniqueKey: ["id"]

  }

}

Insights

  • Flexibility in Documentation: Dataform allows you to document both tables and individual columns, providing a comprehensive view of your data model.
  • Integration with BigQuery: The documentation created in Dataform is directly pushed to BigQuery, allowing for seamless integration with other tools.
  • Enhanced Data Quality Testing: Along with documentation, Dataform core also offers automated data quality testing, ensuring that your data is consistent and reliable.

 

For more information refer to Dataform core on the official Google Cloud documentation page.

The descriptions aren't  working for me:

Screenshot 2024-04-09 at 12.24.36.png

Screenshot 2024-04-09 at 12.23.26.png

test_david_ILS_categories.sqlx file:

config {
type: "table",
description: "Category table",
columns: {
category_id: "category_id",
category_name: "category_name"
}
}

LOAD DATA OVERWRITEmy_project.my_dataset.test_david_ILS_categories
(
category_id String,
category_name String
)
FROM FILES (
format = 'CSV',
uris = ['gs://my_bucket/categories_2024_02*.csv']);
SELECT * FROM my_project.my_dataset.test_david_ILS_categories

 

--
Best regards
David Regalado
Web | Linkedin | Cloudskillsboost

Hi @davidregalado25 ,

Here are a few things to check:

  1. Check Dataform Compilation:

    • Compile After Changes: Did you compile your Dataform project after updating the descriptions in test_david_ILS_categories.sqlx? Changes won't show up until you do.
    • Review Compilation Logs: Look for any errors in the logs that might be related to descriptions.
  2. Permissions:

    • Verify Permissions: Make sure the Dataform service account has permissions to edit table metadata in BigQuery (roles like "BigQuery Data Editor" or "BigQuery Job User").
  3. BigQuery Metadata:

    • Allow Time for Updates: BigQuery can take a little while to reflect changes. Give it some time, then check the descriptions again in the BigQuery UI or query INFORMATION_SCHEMA.COLUMNS.
  4. Syntax or Formatting Issues:

    • Check Indentation and Typos: Double-check the indentation in your config block (in test_david_ILS_categories.sqlx) and ensure there are no typos in your column names.


  1. Check Dataform Compilation:

    • Compile After Changes: Did you compile your Dataform project after updating the descriptions in test_david_ILS_categories.sqlx? Changes won't show up until you do.
    • Review Compilation Logs: Look for any errors in the logs that might be related to descriptions.

The UI says COMPILED. I'm using the UI, if there are more functionality in the CLI, I think it should be included in the UI too.


Permissions:
    • Verify Permissions: Make sure the Dataform service account has permissions to edit table metadata in BigQuery (roles like "BigQuery Data Editor" or "BigQuery Job User")
  1.  

It can create the tables. Only the descriptions are missing.

 BigQuery Metadata:
  • Allow Time for Updates: BigQuery can take a little while to reflect changes. Give it some time, then check the descriptions again in the BigQuery UI or query INFORMATION_SCHEMA.COLUMNS.
There's anything about descriptions when running

select
* from ILS.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'test_david_ILS_categories'
 
  1. Syntax or Formatting Issues:

    • Check Indentation and Typos: Double-check the indentation in your config block (in test_david_ILS_categories.sqlx) and ensure there are no typos in your column names.

Is indentation important? I clicked on the FORMAT button. No luck.

--
Best regards
David Regalado
Web | Linkedin | Cloudskillsboost

Hi @davidregalado25 ,

You mentioned that the Dataform service account can create tables, which is a positive sign. However, there might still be a permissions issue specifically related to modifying table metadata. Please double-check that the service account has the "BigQuery Data Editor" role at the dataset or project level without any restrictions that might prevent metadata changes.

BigQuery sometimes delays reflecting metadata changes in both its UI and the INFORMATION_SCHEMA queries. Although it's frustrating, giving it a bit more time might resolve the issue. If, after waiting, the descriptions still don't appear, the problem likely lies elsewhere.

You've done the right thing by using the FORMAT button in Dataform's UI. However, manual changes after formatting can introduce errors. Indentation is crucial in YAML-like configurations, as it helps define the structure. Here's the correct structure for your SQLX file's config block:

 
config {
  type: "table",
  database: "your_project_id",
  schema: "ILS",
  name: "test_david_ILS_categories",
  description: "A table description here",
  columns: {
    category_id: {
      description: "Description of category_id",
      type: "STRING"
    },
    category_name: {
      description: "Description of category_name",
      type: "STRING"
    }
  }
}

I got this error:

Unexpected property "category_id" in columns column descriptor. Supported properties are: ["description","columns","displayName","dimension","aggregator","expression","tags","bigqueryPolicyTags"]

Screenshot 2024-04-09 at 23.43.43.png

--
Best regards
David Regalado
Web | Linkedin | Cloudskillsboost

Hi @davidregalado25 ,

The error message you received suggests that the way you're attempting to define the column descriptions and types is not recognized by Dataform. Specifically, the properties you should be focusing on within the columns section are those listed in the error message, such as description, displayName, and tags.

You should adjust the configuration to correctly define the column descriptions. Here's how you can structure your table and column descriptions in the SQLX file, focusing on incorporating the supported properties correctly:

 
config {
    type: "table",
    database: "your_project_id",
    schema: "ILS",
    name: "test_david_ILS_categories",
    description: "A table description here",
    columns: {
        category_id: {
            description: "Description of category_id",
            tags: ["example_tag"] // Note: Remove the type property as it's not supported here
        },
        category_name: {
            description: "Description of category_name",
            tags: ["example_tag"] // Note: Similarly, remove the type property
        }
    }
}

In this corrected format, I've removed the unsupported type property from the column descriptors and ensured that only supported properties are included. If you were attempting to specify data types for these columns directly in the SQLX file, it's important to note that Dataform uses this configuration primarily for documentation and metadata purposes, rather than for defining the schema of the table directly in BigQuery.

Additional Steps

  1. Apply the Corrected Configuration: Adjust your SQLX file according to the corrected format and redeploy your Dataform project.
  2. Verify Changes in BigQuery: After redeployment, check again in BigQuery to see if the column descriptions are now appearing as expected.
  3. Review Dataform Documentation: For further clarification on defining column properties and any additional configuration options, reviewing the Dataform documentation can be very helpful.

I've created a new thread here.

--
Best regards
David Regalado
Web | Linkedin | Cloudskillsboost

Is it possible to document tables and column descriptions from a source file instead of a config block? DBT enables the use of a YAML file to document tables and columns, and I think it becomes more organized and clearer.

Is it possible do the same in Dataform?

Dataform does not natively support the use of external source files like YAML for documenting tables and columns in the same way that dbt does. In Dataform, documentation is typically done directly within the SQLX files.

Inline Documentation in SQLX Files:

Dataform prioritizes documentation directly within the config block of each SQL model in the SQLX files. This allows for adding descriptions for the entire table and individual columns, keeping the documentation close to the code and ensuring its relevance. This method is the primary and recommended approach for documentation in Dataform.

Lack of Native YAML Support:

Unlike dbt, which extensively uses YAML files for centralized documentation and configuration, Dataform does not offer native support for external documentation files like YAML. This means that users cannot readily leverage YAML for organized documentation in Dataform without resorting to alternative methods.

Alternative Methods and Limited Tool Availability:

While the idea of using tools and plugins for YAML-based documentation in Dataform is appealing, as of the latest information, such tools are not widely recognized or standard within the Dataform ecosystem.

  • Spectacles: This tool, primarily designed for dbt, lacks documented integration with Dataform and might not be directly applicable.
  • dataform-yaml-config-plugin: This community-developed plugin is relatively new and requires further verification of its functionality and compatibility before being considered a reliable option.

Therefore, users seeking YAML-like functionality might need to consider custom solutions or adapt their workflow to align with Dataform's capabilities. This could involve utilizing JavaScript files, external libraries, or the Dataform API, recognizing potential limitations and integration challenges.

Project and Team Considerations:

The choice between inline documentation in SQLX files and exploring alternative methods for YAML-like functionality depends on project size, team preferences, and desired document manageability. Inline documentation offers close integration with code, while external files provide a clearer overall picture and improved organization, particularly for large projects.

Dataform currently prioritizes inline documentation within SQLX files and lacks native support for YAML-based documentation like dbt. While external tools and plugins offering YAML-like functionality are emerging, they are not yet widely adopted or standard within the Dataform ecosystem. Teams should carefully assess their needs and consider exploring custom solutions if organized documentation outside SQLX files is desired.

In dbt we can do ```dbt docs generate``` that creates documentation and provides a webserver to access it. Is there a `dbt docs serve` equivalent for dataform where from a webpage I can see all the table and field definitions similar to what dbt provides?

Unfortunately, Dataform doesn't have a direct equivalent to the dbt docs serve command that automatically spins up a web server for you to view the generated data documentation. However, understanding the differences between Dataform and dbt, as well as exploring alternative ways to achieve a similar result, can be helpful:

Differences between Dataform and dbt

  • Core Function: Dataform is primarily focused on SQL-based data transformations within BigQuery, streamlining data pipeline development, version control, dependency management, and orchestration within the Google Cloud ecosystem. dbt offers a broader set of features for ELT pipelines that extend beyond data transformation to include snapshotting source data and generating rich documentation as a first-class output.

  • Documentation as an Artifact: dbt treats documentation as an integral part of the development process, generating rich HTML web pages from model descriptions. Dataform allows for inline documentation within SQLX files but lacks a built-in feature to generate a standalone, servable web documentation site like dbt.

Alternatives to Achieve Similar Functionality in Dataform

  • Metadata Exploration in BigQuery: Utilize BigQuery's capabilities for exploring and documenting data:

    • BigQuery Console: Directly explore table schemas and descriptions within the web interface.
    • BigQuery Information Schema: Query metadata about your datasets, including table and column descriptions, to build custom reports or dashboards.
  • Data Catalog Integration: Google Cloud's Data Catalog can serve as a powerful tool for documenting and managing metadata. It automatically captures metadata from BigQuery and can be used to document datasets, tables, and fields, offering a different but valuable form of documentation accessible through a web interface.

  • Custom Solution: While requiring more development effort, building a custom solution provides the greatest flexibility in how the documentation is generated and presented:

    • Extract Metadata: Use SQL queries or scripts to gather metadata from your Dataform project or directly from BigQuery.
    • Templating Engine: Employ a templating engine like Jinja2 or Handlebars to generate HTML pages from the extracted metadata.
    • Lightweight Web Server: Use a web server framework (e.g., Flask for Python) to serve the generated HTML documentation.

Summary

While Dataform does not offer a direct dbt docs serve equivalent, leveraging BigQuery's native features, integrating with Google Cloud's Data Catalog, or developing a custom documentation serving solution are viable alternatives. Each method has its own set of trade-offs in terms of ease of implementation, maintenance, and how closely it can mimic dbt's documentation experience.