SOLUTION: 1.A) At what level do you want to run your analysis? For example, is it the individual level, household level, firm level, activity level, etc? Individual Level 1.B) What variable identifies the above? E.g. is there an individual id number or equivalent? year caseid statefip metro famincome poverty130 foodstamp hhtenure wic pernum lineno wt06 age sex race hispan marst citizen bpl genhealth height weight bmi educ educyrs schlcoll empstat multjobs earnweek hourwage ped soda dietsoda exercise exfreq fastfd enoughfd

IECO 400: Data Cleaning Worksheet
Question 1: Keeping Track of Observations
1.A) At what level do you want to run your analysis? For example, is it the individual level,
household level, firm level, activity level, etc?
Individual Level
1.B) What variable identifies the above? E.g. is there an individual id number or equivalent?
year
caseid
statefip
metro
famincome
poverty130
foodstamp
hhtenure
wic
pernum
lineno
wt06
age
sex
race
hispan
marst
citizen
bpl
genhealth
height weight
bmi
educ
educyrs
schlcoll
empstat
multjobs
earnweek
hourwage
ped
soda
dietsoda
exercise
exfreq
fastfd
enoughfd
1.C) For whatever level you are running your analysis at, are there multiple observations in
the dataset for that level? What is the level of an observation? For example, if you are
interested in a household analysis, and you have a dataset where each observation is an
individual, then you may have multiple individuals per household and therefore households
will show up multiple times, even if individuals only show up once. Typing duplicates report
id_var where id_var is your answer for 1.B can be useful for answering this question.
32990 observations
1.D) Is your data panel or cross-sectional or a mix? For example, Panel means you follow the
same set of households across multiple periods, whereas cross-sectional means you have a
new set of households each period.
Cross-Sectional
Question 2: Main Variables
Create a table or list with your primary variables, that is your main outcome variable, your
main independent or treatment variable, and any control variables.
2.A) For each variable list it’s variable name in Stata and also briefly describe it in words
year – survey year
caseid – atuscaseid
statefip – fips state code
metro – metropolitan/central city status
famincome – family income
poverty130 – household income greater or less than 130% of poverty level
Foodstamp – household received food stamps in past 30 days
hhtenure = living quarters owned, rented, or occupied without rent
wic – received benefits from the wic program in the last 30 days
pernum – person number (general)
lineno – person line number
wt06 – person weight, 2006 methodology
age
sex
race
hispan – hispanic origin.
marst – marital status
citizen – citizen status
bpl – birthplace
Genhealth- genereal health
height – (in inches)
weight – (in pounds)
bmi – Body Mass Index
educ – highest level of school completed
educyrs – years of education
schlcoll – enrollment in school or college
empstat – labor force status
multjobs – has more than one job
earnweek – weekly earnings
hourwage – hourly earnings
ped – PED reports the total amount of time during the diary day that the respondent spent in
primary eating and drinking.
soda – consumed soft drinks such as cola, root beer, or ginger ale
dietsoda – type of soft drink consumed
exercise – participated in physical activities for fitness and health in last 7 days
exfreq – times participated in physical activities in last 7 days
Fastfd – purchased prepared food in last seven days
enoughfd – amount of food eaten in household
2.B) For each variable, indicate whether it is categorical or numerical.
year Numerical
caseid Numerical
statefip Categorical
metro categorical
famincome numerical
poverty130 numerical
foodstamp categorical
hhtenure categorical
wic – categorical
pernum -numerical
lineno – numerical
wt06 – numerical
age numerical
sex – categorical
race – categorical
hispan – categorical
marst – categorical
citizen – categorical
bpl – categorical
genhealth – categorical
height – numerical
weight – numerical
bmi – numerical
educ – categorical
educyrs – categorical
schlcoll. – categorical
empstat – categorical
multjobs – categorical
earnweek – numerical
hourwage – numerical
ped – numerical
soda – categorical
dietsoda – categorical
exercise – categorical
exfreq – numerical
fastfd – categorical
enoughfd. – categorical
2.C) For categorical variables, indicate whether the categories are kept track of through
strings or numbers. Use the tabulate command for each of your categorical variables, one
command at a time, and paste the output here.
All are strings
2.D) For numerical variables, indicate whether the data is discrete or continuous. Use the
summarize command for all of your numerical variables together and copy and paste the
output here.
BMI continuous
Weekly earnings continuous
Hourwage continuous
continuous
PED
discrete
Exfreq
discrete
Question
3:
Outliers and Recoding
3.A) Browse the data for your variables. For each variable, do there seem to be any obvious
outliers or recoded values (e.g. do they use 999 for missing values or something similar)?
Yes, BMI and hour earnings, weekly hours, height, weight, and weekly. Earnings had major
outliers and exceeding values which I used the replace command to replace all missing values
with . Or missing in order to lower down the value and create more effective graph plots.
3.B) Run the bacon command (type search bacon in Stata, then click st0197 in the window
that pops up, then click click here to install) with your primary variables and be sure to use
the generate option to create a variable that stores which observations are outliers. Are any
outliers detected? Rerun your (1) summarize command with the if option specified, to run it
once with only outliers and once without outliers. Are there big changes in the ranges of any
of the variables?
Example commands (replace var1, var2, var3 with your variable names; may have more or
less)
bacon var1 var2 var3, gen(outlier_tag)
sum var1 var2 var3 if outlier_tag == 1
sum var1 var2 var3 if outlier_tag == 0
Bacon command was not needed for my data set.
3.C) Choose one continuous variable and trim the top and bottom 1% of it. For example, ssc
install winsor2 then winsor var1 (replace var1 with your variable name) will generate a new
variable called var1_w that has the top and bottom 1% trimmed. Summarize the trimmed
variable, what is the new mean value? Is this close to the old value or not?
BMI trim through Winsor command.
I replaced bmi values of over 90. The mean value changed to 27.802 which I am sure Is lower
than the older value because we removed 2,584 observations and replaced them with missing
so the mean is more closer to the actual average of the observations rather than the old
mean. A mean of 27.8 seems normal to have and is not skewed to a higher value. For bmi_w,
which is trimmed BMI the mean is 27.7659 so the mean did not change that much but
decreased a little.
Question 4: Misc
4.A) Create a twoway scatter plot of your primary outcome variable (twoway scatter var1
var2) and one of your main numeric treatment or control variables. Do any observations
stand out as far away from the others?
4.B) Do these exercises suggest any data cleaning that should be done prior to running
regressions for your analysis or does the data already seem pretty clean (clean means no
outliers, no need to recode or transform variables, missing values are treated in a suitable
manner)?
Yes there are a few data cleaning exercises that had to be done in order to ensure NIU, and
large numbers which indicate missing indicators are taken into account such as BMI, earn
week, hourly. wages. Over 90, 90000, and 999 and other variables that correspond to higher
values were replaced with missing.
4.C) Do you need any data merged or reshaped? Do you need help doing so?
Not really.

Data+Cleaning+Worksheet+RPD+(myf8) Draft+Presentation

APA

CLICK HERE FOR FURTHER ASSISTANCE ON THIS ASSIGNMENT

The post 1.A) At what level do you want to run your analysis? For example, is it the individual level, household level, firm level, activity level, etc? Individual Level 1.B) What variable identifies the above? E.g. is there an individual id number or equivalent? year caseid statefip metro famincome poverty130 foodstamp hhtenure wic pernum lineno wt06 age sex race hispan marst citizen bpl genhealth height weight bmi educ educyrs schlcoll empstat multjobs earnweek hourwage ped soda dietsoda exercise exfreq fastfd enoughfd appeared first on Apax Researchers.