Bayesian Models¶
- bifsg(first_name, last_name, geography, geo_type)
Implements Bayesian Improved Firstname Surname Geocoding (BIFSG), developed by Voicu (2018) https://www.tandfonline.com/doi/full/10.1080/2330443X.2018.1427012. Pyethnicity augments the Census surname list and HMDA first name list with distributions calculated from voter registration data sourced from L2. BIFSG is implemented as follows:
\[P(r|f,s,g) = \frac{P(r|s) \times P(f|r) \times P(g|r)}{\sum_{r=1}^4 P(r|s) \times P(f|r) \times P(g|r)}\]where r is race, f is first name, s is surname, and g is geography. The sum is across all races, i.e. Asian, Black, Hispanic, and White.
- Parameters:
first_name (Name) – A string or array-like of strings
last_name (Name) – A string or array-like of strings
geography (Geography) – A scalar or array-like of geographies
geo_type (GeoType) – One of zcta or tract
- Returns:
A DataFrame of last_name, geography, and P(r|f,s,g) for Asian, Black, Hispanic, and White. If either the first name, last name or geography cannot be found, the probability is NaN.
- Return type:
pd.DataFrame
Notes
- The data files can be found in:
data/distributionsprob_first_name_given_race.parquet
data/distributionsprob_race_given_last_name.parquet
data/distributionsprob_zcta_given_race_2010.parquet
data/distributionsprob_tract_given_race_2010.parquet
Examples
>>> import pyethnicity >>> pyethnicity.bifsg( first_name="cangyuan", last_name="li", zcta=27106, geo_type="zcta" ) >>> pyethnicity.bifsg( >>> first_name=["cangyuan", "mark"], >>> last_name=["li", "luo"], >>> zcta=[27106, 11106], >>> geo_type="zcta" >>> )
- bisg(last_name, geography, geo_type)
Implements Bayesian Improved Surname Geocoding (BISG), developed by Elliot et. al (2009) https://link.springer.com/article/10.1007/s10742-009-0047-1. Pyethnicity augments the Census surname list with distributions calculated from voter registration data sourced from L2.
\[P(r|s,g) = \frac{P(r|s) \times P(g|r)}{\sum_{r=1}^4 P(r|s) \times P(g|r)}\]where r is race, s is surname, and g is geography. The sum is across all races, i.e. Asian, Black, Hispanic, and White.
- Parameters:
last_name (Name) – A string or array-like of strings
geography (Geography) – A scalar or array-like of geographies
geo_type (GeoType) – One of zcta or tract
- Returns:
A DataFrame of last_name, geography, and P(r | s, g) for Asian, Black, Hispanic, and White. If either the last name or geography cannot be found, the probability is NaN.
- Return type:
pd.DataFrame
Notes
- The data files can be found in:
data/distributionsprob_race_given_last_name.parquet
data/distributionsprob_zcta_given_race_2010.parquet
data/distributionsprob_tract_given_race_2010.parquet
Examples
>>> import pyethnicity >>> pyethnicity.bisg(last_name="li", zcta=27106, geo_type="zcta") >>> pyethnicity.bisg(last_name=["li", "luo"], zcta=[27106, 11106], geo_type="zcta")