Machine Learning Models

predict_race(first_name, last_name, geography, geo_type, chunksize=1028)

Predict race from first name, last name, and geography. The output from pyethnicity.predict_race_flg is ensembled with pyethnicty.bisg and pyethnicty.bifsg.

Parameters:
  • first_name (Name) – A string or array-like of strings

  • last_name (Name) – A string or array-like of strings

  • geography (Geography) – A scalar or array-like of geographies

  • geo_type (GeoType) – One of zcta or tract

  • chunksize (int, optional) – How many rows are passed to the ONNX session at a time, by default 1028

Returns:

A DataFrame of first_name, last_name, geography, and P(r|n,g) for Asian, Black, Hispanic, and White. If the geography cannot be found, the probability is NaN.

Return type:

pd.DataFrame

Notes

The data files can be found in:
  • data/models/first_last.onnx

  • data/distributions/prob_race_given_last_name.parquet

  • data/distributions/prob_zcta_given_race_2010.parquet

  • data/distributions/prob_tract_given_race_2010.parquet

  • data/distributionsprob_first_name_given_race.parquet

Examples

>>> import pyethnicity
>>> pyethnicity.predict_race(
>>>     first_name="cangyuan", last_name="li", geography=11106, geo_type="zcta"
>>> )
>>> pyethnicity.predict_race(
>>>     first_name=["cangyuan", "mark"], last_name=["li", "luo"],
>>>     geography=[11106, 27106], geo_type="zcta"
>>> )
predict_race_fl(first_name, last_name, chunksize=1028)

Predict race from first and last name.

Parameters:
  • first_name (Name) – A string or array-like of strings

  • last_name (Name) – A string or array-like of strings

  • chunksize (int, optional) – How many rows are passed to the ONNX session at a time, by default 1028

Returns:

A DataFrame of first_name, last_name, and P(r|f,s) for Asian, Black, Hispanic, and White.

Return type:

pd.DataFrame

Notes

The data files can be found in:
  • data/models/first_last.onnx

Examples

>>> import pyethnicity
>>> pyethnicity.predict_race_fl(first_name="cangyuan", last_name="li")
>>> pyethnicity.predict_race_fl(
        first_name=["cangyuan", "mark"], last_name=["li", "luo"]
    )
predict_race_flg(first_name, last_name, geography, geo_type, chunksize=1028)

Predict race from first name, last name, and geography. The output from pyethnicity.predict_race_fl is combined with geography using Naive Bayes:

\[P(r|n,g) = \frac{P(r|n) \times P(g|r)}{\sum_{r=1}^4 P(r|n) \times P(g|r)}\]

where r is race, n is name, and g is geography. The sum is across all races, i.e. Asian, Black, Hispanic, and White.

Parameters:
  • first_name (Name) – A string or array-like of strings

  • last_name (Name) – A string or array-like of strings

  • geography (Geography) – A scalar or array-like of geographies

  • geo_type (GeoType) – One of zcta or tract

  • chunksize (int, optional) – How many rows are passed to the ONNX session at a time, by default 1028

Returns:

A DataFrame of first_name, last_name, geography, and P(r|n,g) for Asian, Black, Hispanic, and White.

Return type:

pd.DataFrame

Notes

The data files can be found in:
  • data/models/first_last.onnx

  • data/distributions/prob_race_given_last_name.parquet

  • data/distributions/prob_zcta_given_race_2010.parquet

  • data/distributions/prob_tract_given_race_2010.parquet

Examples

>>> import pyethnicity
>>> pyethnicity.predict_race_flg(
>>>     first_name="cangyuan", last_name="li", geography=11106, geo_type="zcta"
>>> )
>>> pyethnicity.predict_race_flg(
>>>     first_name=["cangyuan", "mark"], last_name=["li", "luo"],
>>>     geography=[11106, 27106], geo_type="zcta"
>>> )