Using statistical software to do your homework.

You may use any software you like to do this assignment. Below are instructions for using R and PSPP/SPSS. Below is a list of the methods you will need to do this assignment. For R users, I have written the instructions assuming your data set is called "x".

  1. Getting the data:
    1. In PSPP/SPSS: Save the data somewhere you can find them later. Then go to "File → Import delimited text data" (in PSPP) or "File → Open → Data" (in SPSS) and tell the program which file contains your data (this is where you need to remember the location of your data file). A series of dialog boxes will appear to guide you through the process if importing the data. Remember that your data are tab-delimited (not space-delimited) and that variable names are at the top of the file.

      Most problems occur at this stage because of simple mistakes. For example, students often save the data as a "Web page", meaning as HTML. This inserts characters that make the data unsuitable. To avoid this problem, make sure you save the data as text. Another common problem is to fail to tell the program that variable names are at the top of the file. The next most common mistake is to tell the program that the data are space-delimited, rather than tab-delimited. Make sure you have not made any of these mistakes.

    2. In R: type "x=read.table(file='http://math.gcsu.edu/~jhs/2600/sentence-data-sample.txt', header=TRUE)".
  2. Frequency tables: Use these to make the usual distribution table for a discrete random variable.
    1. In PSPP/SPSS: Follow the dropdown menus: "Analyze → Descriptive Statistics → Frequencies". Choose the variable you want summarized and click "OK".
    2. In R: Assuming your data is in an object called "x", call "table(x$varname)", where "varname" is the name of the column you want to summarize. To see the relative frequencies of these tabulated values, type "table(x$varname)/sum(table(x$varname))".
  3. Cross-tabulation or "crosstabs": Use this procedure to see how two variables relate to each other.
    1. PSPP/SPSS: Follow the dropdown menus: "Analyze → Descriptive Statistics → Crosstabs". Choose the variables you want to appear in the rows and columns, and click "OK".
    2. In R: Assuming your data is in an object called "x", call "table(x$varname1,x$varname2)", where "varname1" and "varname2" are the names of the columns you want to summarize. To see the relative frequencies of these tabulated values, divide your table by the total number of cases, as in the previous section describing how to make a frequency table.
  4. Descriptive statistics: Finding sample means and variances is mostly an activity for data analysis, rather than probability, but here our sample space fits in a machine as a data set would. Each sample point has the same chance, and that means that each random variable has an expected value equal to the sample mean as reported by statistical software. It also means the variance of each random variable is equal to the sample variance as reported by statistical software. (Why?)

    So to compute any expected values and variances, you can just select the random variable you are interested in and check its descriptive statistics.

    1. PSPP/SPSS: Go to "Analyze → Descriptive Statistics → Descriptives". Choose the variables you want summarized and click "OK".
    2. In R: If you wan to see the mean of a column, type "mean(x$varname)", replacing "varname" with the name of your column. If you want to know the variance of a column, type "var (x$varname)".
  5. Generating new variables (columns) from old. After creating a new variable, you can answer questions by making frequency tables and cross-tabulations as described above.

    Suppose, for example, you want to know what is the probability that a randomly selected sentence contains the word "the" given both that it was written by Melville and contains "of". You can create a couple of variables, called "melville" and "has.of". "melville" is 1 if the sentence was written by Melville and 0 otherwise. "has.of" is 1 if the sentence has the word "of" in it, and 0 otherwise. Then you can create a third variable called "melville.of" which is the product of "melville" and "has.of". One of four possibilities can happen:
    melvillehas.ofmelville.of
    111
    100
    010
    000
    Now the variable "melville.of" is 1 exactly when a sentence was written by Melville and contains the word "of". So you can compute the aforementioned conditional probability by making a cross-tabulation of the variables "the" and "melville.of".

    To create the variable "melville.of", you need to follow three steps:

    1. Make a column that contains a 1 for the sentences written by Melville, and 0 otherwise. In PSPP/SPSS you can do this with "Transform → Recode into different variable". See below for details.
    2. Make a column that contains a 1 for the sentences that contain the word "of", and 0 otherwise. In PSPP/SPSS you can do this also with "Transform → Recode into different variables". See below for details.
    3. Multiply the two previous columns together. In PSPP/SPSS you can do this with "Transform → Compute". See below for details.

    Recoding and transforming variables in R and PSPP/SPSS:

    1. PSPP/SPSS:
      1. Transform → Recode into different variables: To change, say, a variable "X" into a variable "Y" that is 1 if X > 0 and 0 otherwise: After opening the dialog box, select the "old" variable you want to recode, in this case X. Then in the text box for the output variable, type the name of the new variable, in this case Y. Then click the "Old and new values" button. A new dialog box will open. On the left hand side of this box, you can specify the values of X you want to change. On the right hand side, select what values of Y those old values should be mapped to. When you add a rule to recode the data, click "Add". So, in our example of recoding X into Y, on the left hand side you would select "0" as the "old value" of X and "0" as the "new value" of Y on the left hand side, then click "Add". Then at the bottom left, choose "All other values" for X and enter a "1" for Y at the top right. Then click "Add" again, then "Continue". The click "OK" and you should see Y appear in the data sheet.
      2. Transform → Compute: This step is for making new variables via straightforward calculations. So, for example, to make a new variable Z by adding two other variables, say X and Y, together, go to Transform → Compute. In the target variable box, type the name of the new variable, in this case Z. In the box for typing numeric expressions, type whatever Z is defined to be, in this case X+Y. You can use the keypad and list of functions below to help you. When you are done, click "OK" and Z should appear.
    2. In R:
      1. You can recode a variable Y to be 1 if, say, x$he is 0 and 1 otherwise like this: "y=rep(0, length (x$he))". Then to put 1's in the correct entries of y, type "y[which(x$he > 0)] = 1".
      2. To make a new variable Z that is a function of other variables, just type what you would expect to type, for example "Z = X + Y" to compute the sum of X and Y, or "Z = X * Y" to compute the product.