Chapter 2 Data sources

With the constant economic development and drastic urban growth of the US, it is intriguing to find datasets that could help us identify socioeconomic changes and factors that affect these changes across different states and counties in the US. Ivy encountered this comprehensive dataset —- Atlas of Rural and Small-town America, which was aggregated by USDA agencies and the ERS research team (dataset link:https://www.ers.usda.gov/data-products/atlas-of-rural-and-small-town-america/download-the-data/). This dataset provides us with a broad range of socioeconomic factors across the nation, which are classified based on the general categories. These datasets include People.csv, Jobs.csv, County classifications.csv, Income.csv, Veterans.csv, with a complementary file explaining the variables used in each of the dataset. The following paragraphs describes what each of the dataset contains in a little more detail .

The raw data has 3278 rows grouped by states, where each row represents a county. One important thing to notice is that the first row of the whole dataset is the data about US overall, and the first row of each state shows the overall information of each variable for the United States, and the first row of each group contains information of each variable for that particular state; therefore, the data is actually describing information of 3225 rural counties.

People.csv contains demographic data from the American Community Survey, including information about the total population rate change, international migration, migration, immigration, education, population density, age composition, race and ethnicity composition, and family composition across all states and counties. Most variables are numerical, except for the State and County columns.

Job.csv contains Economic data from the Bureau of Labor Statistics and other resources, including information about unemployment rate, labor force, occupational composition. Most entries are counts and percentages information of each variable which are all numerical, except for the State and County columns.

Income.csv contains Economic data from the U.S Census Bureau’s Small Area Income and Poverty Estimates (SAIPE), including information about Poverty rate for all age, poverty for age under 18, Deep poverty, median household income and per capita income. Also, Most entries are counts and percentages information of each variable which are all numerical, except for the State and County columns.

Veterans.csv contains data that represents indicators from the American Community Survey on social and economic conditions of veterans; however, we decided not include this dataset for our analyses, since we would like to focus on more general socioeconomic status variables for this project such as income rates and unemployment rates.

County classifications.csv contains information about several categorical variables, such as Rural-urban continuum, population loss, manufacturing-dependent counties, economic dependence and other ERS county codes. Some of these variables are binary, where 1 means the county is in that category and 0 means otherwise. Others are categorical variables showing which level a county is in for that variable.

2.1 Possible issues

A significant number of social and economic variables of 3225 rural counties are included in these dataset; in addition, a considerable large number of variables only have data on a specific year. Thus, it could be difficult for our team to accurately capture all the information and patterns. Moreover, some data dates before 2005, which may be less valuable for us to dig in. Hence, we might select a subset of variables that are more relevant and comprehensive to work with. For example, we may use the average of the more recent years of some variables to analyze dependancy andspatial relationships, and use the data on a range of different years for time series analysis.