Bayesian Models

bifsg(first_name, last_name, geography, geo_type)

Implements Bayesian Improved Firstname Surname Geocoding (BIFSG), developed by Voicu (2018) https://www.tandfonline.com/doi/full/10.1080/2330443X.2018.1427012. Pyethnicity augments the Census surname list and HMDA first name list with distributions calculated from voter registration data sourced from L2. BIFSG is implemented as follows:

\[P(r|f,s,g) = \frac{P(r|s) \times P(f|r) \times P(g|r)}{\sum_{r=1}^4 P(r|s) \times P(f|r) \times P(g|r)}\]

where r is race, f is first name, s is surname, and g is geography. The sum is across all races, i.e. Asian, Black, Hispanic, and White.

Parameters:
  • first_name (Name) – A string or array-like of strings

  • last_name (Name) – A string or array-like of strings

  • geography (Geography) – A scalar or array-like of geographies

  • geo_type (GeoType) – One of zcta or tract

Returns:

A DataFrame of last_name, geography, and P(r|f,s,g) for Asian, Black, Hispanic, and White. If either the first name, last name or geography cannot be found, the probability is NaN.

Return type:

pd.DataFrame

Notes

The data files can be found in:
  • data/distributionsprob_first_name_given_race.parquet

  • data/distributionsprob_race_given_last_name.parquet

  • data/distributionsprob_zcta_given_race_2010.parquet

  • data/distributionsprob_tract_given_race_2010.parquet

Examples

>>> import pyethnicity
>>> pyethnicity.bifsg(
        first_name="cangyuan", last_name="li", zcta=27106, geo_type="zcta"
    )
>>> pyethnicity.bifsg(
>>> first_name=["cangyuan", "mark"],
>>>     last_name=["li", "luo"],
>>>     zcta=[27106, 11106],
>>>     geo_type="zcta"
>>> )
bisg(last_name, geography, geo_type)

Implements Bayesian Improved Surname Geocoding (BISG), developed by Elliot et. al (2009) https://link.springer.com/article/10.1007/s10742-009-0047-1. Pyethnicity augments the Census surname list with distributions calculated from voter registration data sourced from L2.

\[P(r|s,g) = \frac{P(r|s) \times P(g|r)}{\sum_{r=1}^4 P(r|s) \times P(g|r)}\]

where r is race, s is surname, and g is geography. The sum is across all races, i.e. Asian, Black, Hispanic, and White.

Parameters:
  • last_name (Name) – A string or array-like of strings

  • geography (Geography) – A scalar or array-like of geographies

  • geo_type (GeoType) – One of zcta or tract

Returns:

A DataFrame of last_name, geography, and P(r | s, g) for Asian, Black, Hispanic, and White. If either the last name or geography cannot be found, the probability is NaN.

Return type:

pd.DataFrame

Notes

The data files can be found in:
  • data/distributionsprob_race_given_last_name.parquet

  • data/distributionsprob_zcta_given_race_2010.parquet

  • data/distributionsprob_tract_given_race_2010.parquet

Examples

>>> import pyethnicity
>>> pyethnicity.bisg(last_name="li", zcta=27106, geo_type="zcta")
>>> pyethnicity.bisg(last_name=["li", "luo"], zcta=[27106, 11106], geo_type="zcta")