# load in the libraries
# for num_rev and test_data
library(adlgraphs)
# this is for some other basic data transformations
library(dplyr)
# for working with labelled data
adlgraphs
provides three main functions to reduce the
amount of time it takes to perform a few very common data
transformations:
num_rev()
- reverses the values of a vectormake_dicho()
- converts a vector to a dichotomous factormake_binary()
- converts a vector to a binary (0 and 1)
num_rev()
num_rev()
was designed with
forcats::fct_rev()
in mind. However, instead of operating
on factors, num_rev()
operates on numeric vectors. Let’s
take a look at how this operates in practice. To do so, let’s look at
the variable top
from the data set test_data
.
As we can see, survey respondents saw the a positively valanced
statement, “I would feel comfortable buying products from Israel”, with
the values ranging from 1 to 4 where 1 = Strongly agree
and
4 = Strongly disagree
.
str(test_data$top)
#> dbl+lbl [1:250] 1, 2, 2, 3, 2, 4, 2, 2, 2, 4, 4, 2, 4, 3, 4, 2, 3, 4, 2, 4...
#> @ label : chr "An ideal society requires some groups to be on top and others to be on the bottom"
#> @ format.spss : chr "F40.0"
#> @ display_width: int 5
#> @ labels : Named num [1:4] 1 2 3 4
#> ..- attr(*, "names")= chr [1:4] "Strongly agree" "Somewhat agree" "Somewhat disagree" "Strongly disagree"
In survey research, we often have to reverse a variable. This is
primarily done for two reasons. The first is because we want to flip the
valence of the question. One reason this might be the case is because we
wanted to create an index by summing this up with other variables. If
the other statements were negatively valanced, it may be in our best
interest to keep the value labels the same but reverse the valance of
top
so that it can now be interpreted as “I would not feel
comfortable buying products from Israel.” Since the value labels don’t
change, the people who were previously classified as “Strongly agree”
are now going be classified as “Strongly disagree”, etc.
The second reason one might reverse the values, is to change the
direction of the scale. For example, as the values in top
increase, so does one’s level of disagreement. We might want to reverse
the values so that now a 1 = “Strongly disagree” and a 4 = “Strongly
agree”, and thus a higher number means more agreement. Because this
method does not flip the valance of the question, people who were
originally classified as “Strongly agree” are still going to be
classified as “Strongly agree”. Put another way, the question isn’t
changing and therefore people’s responses aren’t changing. The only
thing that changes is the value associated with their response.
num_rev()
was designed with the second goal in mind.
Normally when simply reversing the values by subtracting them, all
underlying metadata and attributes are lost. As a result, we would have
to reverse the values, update the value labels and set the variable
label again. This is really time consuming. The purpose of
num_rev()
is to fix this by automating this process of
reversing a numeric vector while maintaining the variable and value
labels. In addition, this function adds a new attribute called
transformation
that describes the data transformation used
to create this variable.
I’m now going to show this function in action. We can see that when
top
= 1, top_rev
= 4; when top
=
2, top_rev
= 3; top
= 3, top_rev
= 2; and when top
= 4, top_rev
= 1.
new_df <- test_data %>%
# let's make a new variable with the num_rev function
mutate(top_rev = num_rev(top)) %>%
# keep only these two variables
select(top_rev, top)
head(new_df)
#> # A tibble: 6 × 2
#> top_rev top
#> <dbl+lbl> <dbl+lbl>
#> 1 4 [Strongly agree] 1 [Strongly agree]
#> 2 3 [Somewhat agree] 2 [Somewhat agree]
#> 3 3 [Somewhat agree] 2 [Somewhat agree]
#> 4 2 [Somewhat disagree] 3 [Somewhat disagree]
#> 5 3 [Somewhat agree] 2 [Somewhat agree]
#> 6 1 [Strongly disagree] 4 [Strongly disagree]
This function does a lot more than just 5 - top
. Let’s
take a look. Using str()
we can see that both variables
have “label” and “labels” attributes and top_rev
has a new
attribute called “transformation”. The new “transformation” attribute is
automatically added and describes what sort of data transformation the
variable underwent when it was created. This is valuable in case you
forgot how it was created. In addition, both variables have the same
variable label as seen in the “label” attribute. The key difference
between these can be found in the “labels” attribute. In
top_rev
, the labels are reversed. In the original variable,
top
, 1 = “Strongly agree”, 2 = “Somewhat agree”, etc.
However, in the new variable 4 = “Strongly agree”, 3 = “Somewhat agree”,
etc.
str(new_df)
#> tibble [250 × 2] (S3: tbl_df/tbl/data.frame)
#> $ top_rev: dbl+lbl [1:250] 4, 3, 3, 2, 3, 1, 3, 3, 3, 1, 1, 3, 1, 2, 1, 3, 2, 1, ...
#> ..@ labels : Named num [1:4] 1 2 3 4
#> .. ..- attr(*, "names")= chr [1:4] "Strongly disagree" "Somewhat disagree" "Somewhat agree" "Strongly agree"
#> ..@ label : chr "An ideal society requires some groups to be on top and others to be on the bottom"
#> ..@ transformation: 'glue' chr "Reversing 'top' while maintaining correct value labels"
#> $ top : dbl+lbl [1:250] 1, 2, 2, 3, 2, 4, 2, 2, 2, 4, 4, 2, 4, 3, 4, 2, 3, 4, ...
#> ..@ label : chr "An ideal society requires some groups to be on top and others to be on the bottom"
#> ..@ format.spss : chr "F40.0"
#> ..@ display_width: int 5
#> ..@ labels : Named num [1:4] 1 2 3 4
#> .. ..- attr(*, "names")= chr [1:4] "Strongly agree" "Somewhat agree" "Somewhat disagree" "Strongly disagree"
Now, there are two reasons that the different labels is significant. First, it means that when we check the frequencies for both variables we will get the same results. The only difference is the order of the response options.
get_freq_table(new_df, top_rev)
An ideal society requires some groups to be on top and others to be on the bottom | N | Percent |
---|---|---|
Strongly disagree | 65 | 26% |
Somewhat disagree | 75 | 30% |
Somewhat agree | 85 | 34% |
Strongly agree | 25 | 10% |
get_freq_table(new_df, top)
An ideal society requires some groups to be on top and others to be on the bottom | N | Percent |
---|---|---|
Strongly agree | 25 | 10% |
Somewhat agree | 85 | 34% |
Somewhat disagree | 75 | 30% |
Strongly disagree | 65 | 26% |
Second, the means of the variable will be different. For
top
, higher score means more disagreement. However, for
top_rev
, a higher number means more agreement. We can see
the differences below.