Overview¶
Pyethnicity is a library that helps you proxy race / ethnicity based on name and location. It is free and open-source, released under the MIT license. Race prediction is important in many contexts. For example, the CFPB uses BISG to conduct their fair lending analysis. Better race prediction can have real-world impact on lending, healthcare, law, and more. This software is intended to serve as a tool for positive and constructive purposes. By using this software, you agree to employ it in an ethical manner. Under no circumstances should pyethnicity be used to engage in or promote discrimination based on race, ethnicity, or any other characteristic, in any shape or form.
The data used in this library can be found in its data folder. Geographic data (such as the percent of Asian people in a ZCTA) come from the 2010 United States Census Summary file 1. Surname data is a combination of the 2010 United States Census Frequently Occurring Surnames and proprietary 2022 voter registration data from L2, Inc. First name data is a combination of HMDA data sourced from “Demographic Aspects of First Names” [1] and the aforementioned voter registration data.
Pyethnicity provides routines for BISG, BIFSG, and a BiLSTM model. More details about data, model development, and performance can be found in the corresponding paper, “Can We Trust Race Prediction?” [2].
Installation¶
The easiest way to install pyethnicity is through pip. Simply run
pip install pyethnicity
Note that pyethnicity depends on several packages
Bayesian Models¶
- bifsg(first_name, last_name, geography, geo_type)
Implements Bayesian Improved Firstname Surname Geocoding (BIFSG), developed by Voicu (2018) https://www.tandfonline.com/doi/full/10.1080/2330443X.2018.1427012. Pyethnicity augments the Census surname list and HMDA first name list with distributions calculated from voter registration data sourced from L2. BIFSG is implemented as follows:
\[P(r|f,s,g) = \frac{P(r|s) \times P(f|r) \times P(g|r)}{\sum_{r=1}^4 P(r|s) \times P(f|r) \times P(g|r)}\]where r is race, f is first name, s is surname, and g is geography. The sum is across all races, i.e. Asian, Black, Hispanic, and White.
- Parameters:
first_name (Name) – A string or array-like of strings
last_name (Name) – A string or array-like of strings
geography (Geography) – A scalar or array-like of geographies
geo_type (GeoType) – One of zcta or tract
- Returns:
A DataFrame of last_name, geography, and P(r|f,s,g) for Asian, Black, Hispanic, and White. If either the first name, last name or geography cannot be found, the probability is NaN.
- Return type:
pd.DataFrame
Notes
- The data files can be found in:
data/distributionsprob_first_name_given_race.parquet
data/distributionsprob_race_given_last_name.parquet
data/distributionsprob_zcta_given_race_2010.parquet
data/distributionsprob_tract_given_race_2010.parquet
Examples
>>> import pyethnicity >>> pyethnicity.bifsg( first_name="cangyuan", last_name="li", zcta=27106, geo_type="zcta" ) >>> pyethnicity.bifsg( >>> first_name=["cangyuan", "mark"], >>> last_name=["li", "luo"], >>> zcta=[27106, 11106], >>> geo_type="zcta" >>> )
- bisg(last_name, geography, geo_type)
Implements Bayesian Improved Surname Geocoding (BISG), developed by Elliot et. al (2009) https://link.springer.com/article/10.1007/s10742-009-0047-1. Pyethnicity augments the Census surname list with distributions calculated from voter registration data sourced from L2.
\[P(r|s,g) = \frac{P(r|s) \times P(g|r)}{\sum_{r=1}^4 P(r|s) \times P(g|r)}\]where r is race, s is surname, and g is geography. The sum is across all races, i.e. Asian, Black, Hispanic, and White.
- Parameters:
last_name (Name) – A string or array-like of strings
geography (Geography) – A scalar or array-like of geographies
geo_type (GeoType) – One of zcta or tract
- Returns:
A DataFrame of last_name, geography, and P(r | s, g) for Asian, Black, Hispanic, and White. If either the last name or geography cannot be found, the probability is NaN.
- Return type:
pd.DataFrame
Notes
- The data files can be found in:
data/distributionsprob_race_given_last_name.parquet
data/distributionsprob_zcta_given_race_2010.parquet
data/distributionsprob_tract_given_race_2010.parquet
Examples
>>> import pyethnicity >>> pyethnicity.bisg(last_name="li", zcta=27106, geo_type="zcta") >>> pyethnicity.bisg(last_name=["li", "luo"], zcta=[27106, 11106], geo_type="zcta")
Machine Learning Models¶
- predict_race(first_name, last_name, geography, geo_type, chunksize=1028)
Predict race from first name, last name, and geography. The output from pyethnicity.predict_race_flg is ensembled with pyethnicty.bisg and pyethnicty.bifsg.
- Parameters:
first_name (Name) – A string or array-like of strings
last_name (Name) – A string or array-like of strings
geography (Geography) – A scalar or array-like of geographies
geo_type (GeoType) – One of zcta or tract
chunksize (int, optional) – How many rows are passed to the ONNX session at a time, by default 1028
- Returns:
A DataFrame of first_name, last_name, geography, and P(r|n,g) for Asian, Black, Hispanic, and White. If the geography cannot be found, the probability is NaN.
- Return type:
pd.DataFrame
Notes
- The data files can be found in:
data/models/first_last.onnx
data/distributions/prob_race_given_last_name.parquet
data/distributions/prob_zcta_given_race_2010.parquet
data/distributions/prob_tract_given_race_2010.parquet
data/distributionsprob_first_name_given_race.parquet
Examples
>>> import pyethnicity >>> pyethnicity.predict_race( >>> first_name="cangyuan", last_name="li", geography=11106, geo_type="zcta" >>> ) >>> pyethnicity.predict_race( >>> first_name=["cangyuan", "mark"], last_name=["li", "luo"], >>> geography=[11106, 27106], geo_type="zcta" >>> )
- predict_race_fl(first_name, last_name, chunksize=1028)
Predict race from first and last name.
- Parameters:
first_name (Name) – A string or array-like of strings
last_name (Name) – A string or array-like of strings
chunksize (int, optional) – How many rows are passed to the ONNX session at a time, by default 1028
- Returns:
A DataFrame of first_name, last_name, and P(r|f,s) for Asian, Black, Hispanic, and White.
- Return type:
pd.DataFrame
Notes
- The data files can be found in:
data/models/first_last.onnx
Examples
>>> import pyethnicity >>> pyethnicity.predict_race_fl(first_name="cangyuan", last_name="li") >>> pyethnicity.predict_race_fl( first_name=["cangyuan", "mark"], last_name=["li", "luo"] )
- predict_race_flg(first_name, last_name, geography, geo_type, chunksize=1028)
Predict race from first name, last name, and geography. The output from pyethnicity.predict_race_fl is combined with geography using Naive Bayes:
\[P(r|n,g) = \frac{P(r|n) \times P(g|r)}{\sum_{r=1}^4 P(r|n) \times P(g|r)}\]where r is race, n is name, and g is geography. The sum is across all races, i.e. Asian, Black, Hispanic, and White.
- Parameters:
first_name (Name) – A string or array-like of strings
last_name (Name) – A string or array-like of strings
geography (Geography) – A scalar or array-like of geographies
geo_type (GeoType) – One of zcta or tract
chunksize (int, optional) – How many rows are passed to the ONNX session at a time, by default 1028
- Returns:
A DataFrame of first_name, last_name, geography, and P(r|n,g) for Asian, Black, Hispanic, and White.
- Return type:
pd.DataFrame
Notes
- The data files can be found in:
data/models/first_last.onnx
data/distributions/prob_race_given_last_name.parquet
data/distributions/prob_zcta_given_race_2010.parquet
data/distributions/prob_tract_given_race_2010.parquet
Examples
>>> import pyethnicity >>> pyethnicity.predict_race_flg( >>> first_name="cangyuan", last_name="li", geography=11106, geo_type="zcta" >>> ) >>> pyethnicity.predict_race_flg( >>> first_name=["cangyuan", "mark"], last_name=["li", "luo"], >>> geography=[11106, 27106], geo_type="zcta" >>> )