# Fun with Data: Calculating Gender Balance Using First Names

• 385 views

Userlevel 6
• Looker Staff
• 213 replies

In the The Name Game: Step 3 we were able to compute, given a name, the percent likelihood that the name is male. The calculation is computed by dividing the size of the male population for a given name by the total population for that name.

### Some Gender Neutral Names

Explore From Here

### Building a Name Map

We can use the same table in BigQuery and this simple query to build a lookup table that maps names to their likelihood of being male. If we don’t have data (later when we join and the join fails), we assume a 50% probability.

Old LookML
``````- view: gender_guess
derived_table:
sql: |
SELECT
UPPER(name) AS name
, FLOAT(SUM(CASE WHEN gender = 'M' THEN number ELSE 0 END))
/ SUM(number) AS percentage_male
FROM [fh-bigquery:popular_names.usa_1910_2013]
GROUP EACH BY 1
fields:
- dimension: name
- dimension: percentage_male
type: number
sql: COALESCE(\${TABLE}.percentage_male, 0.5)
``````

New LookML
``````view: gender_guess {
derived_table: {
sql: SELECT
UPPER(name) AS name
, FLOAT(SUM(CASE WHEN gender = 'M' THEN number ELSE 0 END))
/ SUM(number) AS percentage_male
FROM [fh-bigquery:popular_names.usa_1910_2013]
GROUP EACH BY 1
;;
}

dimension: name {}

dimension: percentage_male {
type: number
sql: COALESCE(\${TABLE}.percentage_male, 0.5) ;;
}
}
``````

### Names and likelihood they are Male

The names with Percentage Male = 1 are certainly male, the names with 0 are certainly female. The fractional names are somewhere between.

Explore From Here

### Names and the United States Patent and Trademark Office (USPTO)

I recently uploaded all the USPTO data to BigQuery. The main table in this dataset is ‘case_files’ and on each case file, there is the name of the attorney assigned to the case.

### Attorney Names

Attorneys and the number of cases they’ve worked on.

Explore From Here

### Parsing out the First Name

It appears that names are of the form

<LAST_NAME>, <FIRST_NAME> <INITIAL OR NAME>

We can parse the pretty easily using a regular expression. First names appear to immediately follow the comma. We can codify this with a new dimension and a regular expression.

Old LookML
``````  - dimension: exm_attourney_first_name
sql: REGEXP_EXTRACT(\${exm_attorney_name}, `, (\\w+)`)
``````

New LookML
``````dimension: exm_attourney_first_name {
sql: REGEXP_EXTRACT(\${exm_attorney_name}, `, (\\w+)`) ;;
}
``````

And the results:

Explore From Here

### Joining the Tables

Big query doesn’t let us join on expressions, so we have to move the dimension in a derived table. BigQuery is smart enough to optimize this out if we don’t use the expression when referencing this derived table in a query.

Old LookML
``````- view: case_file
derived_table:
sql: |
SELECT *,
REGEXP_EXTRACT(exm_attorney_name, ', (\\w+)') as exm_attorney_first_name
``````

New LookML
``````view: case_file {
derived_table: {
sql: SELECT *,
REGEXP_EXTRACT(exm_attorney_name, ', (\\w+)') as exm_attorney_first_name
;;
}
}
``````

Next we join in gender_guess to the case file.

Old LookML
``````- explore: case_file
joins:
- join: exm_attorney_gender
from: gender_guess
sql_on: \${case_file.exm_attorney_first_name} = \${exm_attorney_gender.name}
relationship: many_to_one
``````

New LookML
``````explore: case_file {
join: exm_attorney_gender {
from: gender_guess
sql_on: \${case_file.exm_attorney_first_name} = \${exm_attorney_gender.name} ;;
relationship: many_to_one
}
}
``````

Now we see names together with gender score (percentage male).

Explore From Here

We’d like to be able to see the count of attorneys and the percentage of those attorneys that were male over time. Summing the probabilities distinctly on the attorney name will give us those counts.

Old LookML
``````- measure: count_male_cases
type: number
sql: SUM(\${exm_attorney_gender.percentage_male})

- measure: percentage_male_cases
type: number
sql: \${count_male_cases}/\${count}
value_format_name: percent_2

- measure: count_attornies
type: count_distinct
sql: \${exm_attorney_name}

- measure: count_male_attornies
type: sum_distinct
sql: \${exm_attorney_gender.percentage_male}
sql_distinct_key: \${exm_attorney_name}

- measure: percentage_male_attornies
type: number
sql: \${count_male_attornies}/\${count_attornies}
value_format_name: percent_2
``````

New LookML
``````measure: count_male_cases {
type: number
sql: SUM(\${exm_attorney_gender.percentage_male}) ;;
}

measure: percentage_male_cases {
type: number
sql: \${count_male_cases}/\${count} ;;
value_format_name: percent_2
}

measure: count_attornies {
type: count_distinct
sql: \${exm_attorney_name} ;;
}

measure: count_male_attornies {
type: sum_distinct
sql: \${exm_attorney_gender.percentage_male} ;;
sql_distinct_key: \${exm_attorney_name} ;;
}

measure: percentage_male_attornies {
type: number
sql: \${count_male_attornies}/\${count_attornies} ;;
value_format_name: percent_2
}
``````

### Gender Mix Over Time

I looks like in 1978 the USPTO Examiner staff was 2/3’s Male, and that those men handled close to 90% of the case load. 10 years later, in 1988, that balance had changed to 50%, with about 50% of the case load being male.

Since then, the examiner staff has become predominately female —now only 40% male, with only 40% of the case load being handled by male attorneys.

Explore From Here