congress_districts

PREDICTING 2024 CONGRESSIONAL DISTRICT OUTCOMES USING POLLING DATA AND DIMENSIONALITY REDUCTION ON AMERICAN COMMUNITY SURVEY DATA

Utilizing singular value decomposition (SVD) with an automated selection criterion, which captures 97.5% of the variation in the data, American Community Survey (ACS) 5-Year Estimates that were originally comprised of 7,378 separate features were reduced to a mere 16 components. On fitting a baseline model with hyperparameter tuning but that omits the relevant pre-election polling data, these 16 components yield a training set overall accuracy of 96.7%, a validation set overall accuracy of 84.1%, and a testing set overall accuracy of 87.4%. Given the shifting congressional district lines between the naively chosen training and validation set year (i.e., 2020) and testing set year (i.e., 2022), coupled with the fact that the latter was also a midterm election cycle where the composition of the vote typically differs from the general election cycle in both number and demographically, these early results display just how powerful a technique like SVD can be by itself for distilling from a large amount of data those salient features that are encoded within it.

Included in this project are three main component files: data_prep.py, development.py, and pipeline.py

The script data_prep.py retrieves the ACS variables using the Census API, prepares the datasets, performs the SVD, and determines the top feature contributions in the calculated components. It requires a registered Census API key (see: https://api.census.gov/data/key_signup.html).

The script development.py is for local development and allows the user to run the XGBoost model and metrics without the need for an AWS account.

The script pipeline.py is the main driver of a SageMaker Pipelines machine learning operations (MLOps) environment and requires an AWS account.

Note: This is an ongoing, active project. Taking a cue from other popular polling aggregate outcome predictors, it is intended as a proof of concept to show the development and deployment of an XGBoost classification model using the Census API, Amazon Code Editor (Visual Studio Code-OSS), and Amazon Web Services (AWS) S3 buckets, EC2 instances, and SageMaker Pipelines. Stay tuned.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
data		data
images		images
steps		steps
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
development.py		development.py
pipeline.py		pipeline.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

congress_districts

About

Releases

Packages

Languages

License

nathan-lindstedt/congress_districts

Folders and files

Latest commit

History

Repository files navigation

congress_districts

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages