Chapter 3 Data transformation

Our raw data contains 5 .csv files each describing one aspect of the socioeconomic status for all counties in the United States.

Since the raw data are already stored in .csv format with proper column names, they can be directly loaded into R data frames.

The first step was to import all the data except for the veterans.csv. This is because general socioeconomic attributes such as unemployment rate and per capita income are more relevant to the questions we are trying to answer, so we will not be focusing on veteran data.

After the importing step, we selected a subset of variables from the People and County Classification dataset, and kept all the variables for the Jobs and Income dataset. Specifically, the variables selected from the People dataset are the rates of different education levels, and the variables selected from the County Classification dataset describe poverty, natural amenity and low education classifications. These attributes are highly related to our topic, and all other not so applicable variables are discarded.

We noticed that the columns in the data frames already have proper data type; hence we didn’t make modifications on that. However we plan to convert the FIPs variable into a factor during future analysis and graphing.

For the last step, we joined the cleaned data into one data frame using the FIPs since FIPs is like an ID and is unique for each county. This new data frame was exported to a csv file under the folder data/cleaned; it will be used for the analyses in the rest of this report.