Overview

Pyethnicity is a library that helps you proxy race / ethnicity based on name and location. It is free and open-source, released under the MIT license. Race prediction is important in many contexts. For example, the CFPB uses BISG to conduct their fair lending analysis. Better race prediction can have real-world impact on lending, healthcare, law, and more. This software is intended to serve as a tool for positive and constructive purposes. By using this software, you agree to employ it in an ethical manner. Under no circumstances should pyethnicity be used to engage in or promote discrimination based on race, ethnicity, or any other characteristic, in any shape or form.

The data used in this library can be found in its data folder. Geographic data (such as the percent of Asian people in a ZCTA) come from the 2010 United States Census Summary file 1. Surname data is a combination of the 2010 United States Census Frequently Occurring Surnames and proprietary 2022 voter registration data from L2, Inc. First name data is a combination of HMDA data sourced from “Demographic Aspects of First Names[1] and the aforementioned voter registration data.

Pyethnicity provides routines for BISG, BIFSG, and a BiLSTM model. More details about data, model development, and performance can be found in the corresponding paper, “Can We Trust Race Prediction?[2].

Installation

The easiest way to install pyethnicity is through pip. Simply run

pip install pyethnicity

Note that pyethnicity depends on several packages

  1. onnxruntime-gpu: For fast and flexible inference

  2. pandas: For the final output DataFrames

  3. polars: For fast and memory efficient cleaning routines

  4. pyarrow: For parquet files

  5. pycutils: A lightweight, stdlib-only collection of useful functions

  6. tqdm: A lightweight progress bar

Bayesian Models

bifsg(first_name, last_name, geography, geo_type)

Implements Bayesian Improved Firstname Surname Geocoding (BIFSG), developed by Voicu (2018) https://www.tandfonline.com/doi/full/10.1080/2330443X.2018.1427012. Pyethnicity augments the Census surname list and HMDA first name list with distributions calculated from voter registration data sourced from L2. BIFSG is implemented as follows:

\[P(r|f,s,g) = \frac{P(r|s) \times P(f|r) \times P(g|r)}{\sum_{r=1}^4 P(r|s) \times P(f|r) \times P(g|r)}\]

where r is race, f is first name, s is surname, and g is geography. The sum is across all races, i.e. Asian, Black, Hispanic, and White.

Parameters:
  • first_name (Name) – A string or array-like of strings

  • last_name (Name) – A string or array-like of strings

  • geography (Geography) – A scalar or array-like of geographies

  • geo_type (GeoType) – One of zcta or tract

Returns:

A DataFrame of last_name, geography, and P(r|f,s,g) for Asian, Black, Hispanic, and White. If either the first name, last name or geography cannot be found, the probability is NaN.

Return type:

pd.DataFrame

Notes

The data files can be found in:
  • data/distributionsprob_first_name_given_race.parquet

  • data/distributionsprob_race_given_last_name.parquet

  • data/distributionsprob_zcta_given_race_2010.parquet

  • data/distributionsprob_tract_given_race_2010.parquet

Examples

>>> import pyethnicity
>>> pyethnicity.bifsg(
        first_name="cangyuan", last_name="li", zcta=27106, geo_type="zcta"
    )
>>> pyethnicity.bifsg(
>>> first_name=["cangyuan", "mark"],
>>>     last_name=["li", "luo"],
>>>     zcta=[27106, 11106],
>>>     geo_type="zcta"
>>> )
bisg(last_name, geography, geo_type)

Implements Bayesian Improved Surname Geocoding (BISG), developed by Elliot et. al (2009) https://link.springer.com/article/10.1007/s10742-009-0047-1. Pyethnicity augments the Census surname list with distributions calculated from voter registration data sourced from L2.

\[P(r|s,g) = \frac{P(r|s) \times P(g|r)}{\sum_{r=1}^4 P(r|s) \times P(g|r)}\]

where r is race, s is surname, and g is geography. The sum is across all races, i.e. Asian, Black, Hispanic, and White.

Parameters:
  • last_name (Name) – A string or array-like of strings

  • geography (Geography) – A scalar or array-like of geographies

  • geo_type (GeoType) – One of zcta or tract

Returns:

A DataFrame of last_name, geography, and P(r | s, g) for Asian, Black, Hispanic, and White. If either the last name or geography cannot be found, the probability is NaN.

Return type:

pd.DataFrame

Notes

The data files can be found in:
  • data/distributionsprob_race_given_last_name.parquet

  • data/distributionsprob_zcta_given_race_2010.parquet

  • data/distributionsprob_tract_given_race_2010.parquet

Examples

>>> import pyethnicity
>>> pyethnicity.bisg(last_name="li", zcta=27106, geo_type="zcta")
>>> pyethnicity.bisg(last_name=["li", "luo"], zcta=[27106, 11106], geo_type="zcta")

Machine Learning Models

predict_race(first_name, last_name, geography, geo_type, chunksize=1028)

Predict race from first name, last name, and geography. The output from pyethnicity.predict_race_flg is ensembled with pyethnicty.bisg and pyethnicty.bifsg.

Parameters:
  • first_name (Name) – A string or array-like of strings

  • last_name (Name) – A string or array-like of strings

  • geography (Geography) – A scalar or array-like of geographies

  • geo_type (GeoType) – One of zcta or tract

  • chunksize (int, optional) – How many rows are passed to the ONNX session at a time, by default 1028

Returns:

A DataFrame of first_name, last_name, geography, and P(r|n,g) for Asian, Black, Hispanic, and White. If the geography cannot be found, the probability is NaN.

Return type:

pd.DataFrame

Notes

The data files can be found in:
  • data/models/first_last.onnx

  • data/distributions/prob_race_given_last_name.parquet

  • data/distributions/prob_zcta_given_race_2010.parquet

  • data/distributions/prob_tract_given_race_2010.parquet

  • data/distributionsprob_first_name_given_race.parquet

Examples

>>> import pyethnicity
>>> pyethnicity.predict_race(
>>>     first_name="cangyuan", last_name="li", geography=11106, geo_type="zcta"
>>> )
>>> pyethnicity.predict_race(
>>>     first_name=["cangyuan", "mark"], last_name=["li", "luo"],
>>>     geography=[11106, 27106], geo_type="zcta"
>>> )
predict_race_fl(first_name, last_name, chunksize=1028)

Predict race from first and last name.

Parameters:
  • first_name (Name) – A string or array-like of strings

  • last_name (Name) – A string or array-like of strings

  • chunksize (int, optional) – How many rows are passed to the ONNX session at a time, by default 1028

Returns:

A DataFrame of first_name, last_name, and P(r|f,s) for Asian, Black, Hispanic, and White.

Return type:

pd.DataFrame

Notes

The data files can be found in:
  • data/models/first_last.onnx

Examples

>>> import pyethnicity
>>> pyethnicity.predict_race_fl(first_name="cangyuan", last_name="li")
>>> pyethnicity.predict_race_fl(
        first_name=["cangyuan", "mark"], last_name=["li", "luo"]
    )
predict_race_flg(first_name, last_name, geography, geo_type, chunksize=1028)

Predict race from first name, last name, and geography. The output from pyethnicity.predict_race_fl is combined with geography using Naive Bayes:

\[P(r|n,g) = \frac{P(r|n) \times P(g|r)}{\sum_{r=1}^4 P(r|n) \times P(g|r)}\]

where r is race, n is name, and g is geography. The sum is across all races, i.e. Asian, Black, Hispanic, and White.

Parameters:
  • first_name (Name) – A string or array-like of strings

  • last_name (Name) – A string or array-like of strings

  • geography (Geography) – A scalar or array-like of geographies

  • geo_type (GeoType) – One of zcta or tract

  • chunksize (int, optional) – How many rows are passed to the ONNX session at a time, by default 1028

Returns:

A DataFrame of first_name, last_name, geography, and P(r|n,g) for Asian, Black, Hispanic, and White.

Return type:

pd.DataFrame

Notes

The data files can be found in:
  • data/models/first_last.onnx

  • data/distributions/prob_race_given_last_name.parquet

  • data/distributions/prob_zcta_given_race_2010.parquet

  • data/distributions/prob_tract_given_race_2010.parquet

Examples

>>> import pyethnicity
>>> pyethnicity.predict_race_flg(
>>>     first_name="cangyuan", last_name="li", geography=11106, geo_type="zcta"
>>> )
>>> pyethnicity.predict_race_flg(
>>>     first_name=["cangyuan", "mark"], last_name=["li", "luo"],
>>>     geography=[11106, 27106], geo_type="zcta"
>>> )

Contributors

Indices and tables