Class September 7.
Instructor's Home Page:  http://crab.rutgers.edu/~goertzel

"Science"  -  the goal of science is to establish empirical theories that
describe and explain reality.   Physical sciences very successful with
this goal, social sciences much less.  Aristotle, three kinds of
knowledge, episteme, techne and phronesis.  In thios course, I
emphasize the techniques - techne.  Next semester, we deal with the
ethics in CJ.  The textbook talks about science, but the workbook is
techniques.

Artistic fields seek beauty:  poetry, painting, music.  Social science
does not tend to have much literary or artistic merit...

Building block is concepts.  Used to describe the world, then to explain
it.  We have to define our concepts explicitly, and we have to measure
them "operational definitions."  We try to be explicity about what we
are saying.  Concepts like "mother" or "race" are not unambiguous, we
have to state what meaning we are giving to them.  In emprical research
the bottom line is measurement, how do we know?  Concepts are good
if they are useful for our purposes.

Concepts should have clear boundaries, be parsimonious.  Virtropy was
an example of a poor concept, the things that made it up really did not
fit togfether in any logical or emprical way.

September 10:

Continue with Chapter One -

Theory and a hypothesis.  Theories are more abstract, they are general
statements.  Hypotheses are testable.  Empirical statements.
Falsifiable.  Something that we can test with our senses, with
observation.  Tautology, a statement that is logically true based on the
wording.

Our statements tend to be probabilistic, not absolute.  "Smoking causes
cancer."  We have a degree of confidence in what we say, and we try to
measure that level of confidence.

Where do we get our theories?  From observations or from other people
or writings by other people.

Induction - from specific observations to generalizations.

Deduction - go from the generalization to the hypotheses we wish to
observe.

What do we study?
"pure vs applied" research.
"advocacy research"  -  trying to prove a point.
"evaluation research"  done to establish whether a program works or
not.  Much applied research is in this category.  Applied research may
also deal with "needs assessment."

Sept 14.  "Broken Windows" shows that there is a continuingdebate
about the causes of crime rates, there is not agreement.  We need to
look at data to answer these questions, particularly where we can
compare cases with different characteristics, i.e., cities with more or
fewer officers.  This is a"natural Experiment"

Look at the exploratory exercise in the workbook.   Open the USA data
file:  is this an "aggregate" or "survey" data file?    It is aggregate.  If the
units have a lot of variation, the "average" figure for a unit, such as a
state, may be misleading.

gss data set is a survey data set, with 2832 people.  This is a sample, the
size was a decision made by the researchers.  To get the "margin of erro"
we can use the formula M = 1/sqrt(N).   This gives a margin of error of
1.9% for this sample.

September 17:

Examples from the section on The Research with Aggregate Data, pages
47 to 53 in the workbook.  The use of a scattergram, a graph which
plots two continuous variables.  An example of a plot of a plotting a
scattergram of height and weight on the blackboard.   The regression
line is the straight line that best fits the points.  The correlation
coefficient is a measure how how closely the points fit a straight line, or
of how well one variable can be predicted by another.  If the correlation
is perfect, the two variables predict each other perfectly.  A perfect
correlation may be +1 or -1, depending on whether the relationship is
"positive" or "negative" (i.e., inverse).  If you square the r it tells you the
percentage of the variance explained using one variable to explain or
predict the other.  An excel file of the relationship between height and
weight is available at:
http://crab.rutgers.edu/~goertzel/heightweight.xls.  The file also
includes the regression formulas. I put the chart and the data in two
separate "sheets".  This is material we will be covering after the first
midterm, so this is just intended to give you an overview.

September 19:

Papers returned.  They were graded rather rigorously using the
following formulas.  The ones submitted online have been regraded to
the same scale.

Grading scale:
printout  5 pts.
1a        9 pts.(3 for description, 2 for each category)
1b        7 pts.(3 for description, 2 for each category)
2a        8 pts.(2 for each variable)
2b        8 pts.(2 for each variable)
2c        4 pts.(2 for each variable)
3a        8 pts.(2 for each line, 2 for how many cases were missing)
3b        10pts.(2 for each line, 2 for how many cases were missing)
4a        3 pts.
4b        2 pts.
4c        6 pts.(2 for each line)
4d        6 pts.(2 for each line)
5a        6 pts.(1 for each state)
5b        6 pts.(1 for each state)
6a        2 pts.
6b        6 pts.(2 for each line)
7a        2 pts.
7b        2 pts.
this should add up to 100 pts.

Chapter Six on Research Designs.

The logic with which a research project is organized.  Our goal is
usually to say something about the relationship between variables.
Variables are characteristics that take different values, e.g, height,
weight, number of hunters, party affiliation, etc.  The variables are what
we study.

In most cases there is one variable we want to understand, we call that
the dependent variable.  This is a decision, it is not inherent in the
variable itself.  For example, we could take weight as our DV.  Then we
look for the Independent Variables that "cause" it, or at least are
correlated with it.

Most rigorous way to explain causal relationships is the Experiment.  In
an experiment you manipulate one Independent Variable.  You control
for all other independent variables or "control variables."  You observe
the Dependent Variable.  You take your "subjects" and sort them into
two identical groups, preferably through random assignment.  There are
ethical limits, either it has to voluntary with "informed consent" or you
have to prove why it isn't.  If you don't know something works for an
illness, it is ethical to experiment with it.    "External validity" - do the
findings apply to other circumstances.  "Internal validity" - was the
experiment carried out correctly.
There are practical limits on experimentation.  Experiments are
deductive, you have a theory and you test it.

Survey -  ask a sample of people questions.  Best used for descriptive
purposes rather than for testing causal hypotheses.  Problem in a survey
is you don't have a before and an after, unless you do a longitudinal
survey.  You don't manipulate the variables, so you don't know which is
cause and which effect.  Statistical analysis is used to "control" for
extraneous variables.  Not as logically rigorous, but it deals with real
world situations and not artificial ones.  Usually we take people's word
for things.
       Structured questionnaires

Unstructrured Interviews
      including Group Interviews - "focus group"

Field research - go into a natural setting and observe what happens.
Anthropologists more often.  More inductive, you observe whatever
happens and try to figure out why.

Aggregate or comparative research.  Take data about geographic or other
groups, often data gathered by government agencies.  Very widely used
in CJ because the system generates a lot of data.  Cross-sectional
comparisons, at one point in time.  Comaring states.  Trend analysis or
time series analysis, looking at how things change over time.

Sept 21

Item 2a1 in the homework.  Comparing two maps, we can see that they
don't have much relationship to each other, so we checked "neither.

Similar -   r would be close to +1
Opposite   -  r would be close to -1
Neither - r is close to zero.

In this example, r = .079, which is closer to zero than to + or - 1.

Two different measures:
The "strength" of the relationship,  "r" is a measure of that.  the further
from zero, the better, the stronger.
The "significance" of the relationship, "p" is a measure of that.   The
closer to zero the better, because this is a measure of the likelihood of
random error.  This is also indicated with asterisks.  Two asterisks **
means p < .01, * means p < .05  "p" means Probability or "Prob."

"Operationalizing a concept"  -  that means finding a way to measure it,
the answer to the question "how do you know" -

Dependent variable, what we are trying to explain is the rape rate.  OUr
IV is hunting.

Relationship between the graphs and the summary statistics.  Summary
statistics allow us to summarize a lot data easily, but we lose a lot of
detail that we can get in a graph.  With the r, if the coefficient is
negative, the regression line of the scatter plot will go down.

Sept 23:  ABC News Video on Junk Science, related readings on the WEB site.
 

September 25:

Overview of Assignment 2b, primarily the use of crosstabs to compute percentages.

Looking at the frequencies for GOVMED we see that there are 1849 respondents and 896 thought the government should help.  To make that a percent we divide the 896 by 1849, getting .4845.  Move the decimal two columns to the right and we get 48.45%.  Percents always add to 100%, you don't know what they mean if you don't know how they add to 100%.  The base of the percent is the 100%, in this case 1849 people are 100% of the respondents.  The phrase "of the" tells you the base of the percent.

Row, column and total percentages.  A percentage is a ratio between a frequency and a base.  In a sentence, the base follows the word "of."  For example, if I say, what percent of the voters voted for Bush, the base is the number of voters.  This is the denominator in the calculation.  The numerator is the number voting for Bush.

For example, assume that:

75 men voted for Bush
95 women voted for Bush
35 men voted for Gore
125 women voted for Gore.

We could put this into a Table:
 
 
Men Women Total
Bush 75
56.65
95
113.3
170
Gore 35
53.24
125
106.7
160
Total 110 220 330
The observed frequencies are in black.
THe expected frequencies are in red.
The chisquare statistic tells if the difference between the observed and expected is greater than we would get by random chance.  If p < .05, we say there is a significant relationship between the variables.

Null hypothesis.  Suppose there is no relationship between gender and voting, what percent of the men would we expect to vote for Bush?  The answer to that is, the same percent as in the total sample.  The same for the women.  What percent of the sample voted for Bush  170/330  which is 51.5%.  As a proportion it is .515.  Using this proportion, we cancompute the "expected frequencies," these are the frequencies we would "expect" under the null hypothesis that there is no difference between the genders.  To compute those, we take the PROPORTION voting for Bush and multiply it by the number of men and then the number of women.  .515 * 110 =  56.65 men. This is not a percent, it is a frequency. .515 * 220 = 113.3 women. .485 * 110 = 53.24
.485 * 220 =

We can answer specific question,

How many men were there in the sample?  110
How many women were there in the sample?  220
How many respondents are there?  330
How many respondents voted for Bush?  170
How many respondents voted for Goire?  160

What percentage of the men voted for Bush?  (column percent)  75/110 = .682 or 68.2%.

What percentage of the Bush voters were men? (row percent)n 75/170 = 44.1%

What percentage of the voters were men who voted for Bush?  (total percent)
75/330 = 22.7%

What percentage of the women voted for Bush?  (column percent)  95/220 = 43.2%-

September 28 -  -

If anyone wants to install the professional version of Microcase, I have some copies to lend out.

Levels of Measurement:  nominal, ordinal, interval and ratio.  This is covered well in our text and in  Levels of Measurement and Units of Analysis .  The concept of expected frequencies is explained well in the Chi Square lesson by Amar Patel   We will demonstrate the use of  the WEB Chi Square Calculator.  Key points about chi square are here. and show how these techniques could be used in the case of   Alleged Racial Profiling by the San Diego Police.

October 1:  Computation of descriptive statistics as described in Tronchim.  Completion of in-class exercise.

Descriptive statistics are about your sample.  Their goal is to summarize the essential characteristics of the sample.

Inferential statistics are about generalizing from your sample to a larger population.  They have the form p = or p = <

Two ways of presenting quantitative data:

graphics

summary or descriptive statistics.

The most basic is the average or mean.  Measure of "central tendency."
Most common is the mean, which is computed by adding them all up and dividing by the N.  Can be distorted by extreme classes.  Mean requires interval data.  Good if the data are approximately "normally" distributed.

Median, which is the case in the middle.  Requires only ordinal data.

Mode, most frequent case, nominal data.

A second concept is dispersion.  How much variation is there, how spread out things are.  The range is an ordinal measure, how far from the lowest to the highest.  Interquartile range, the distance from the 25% points to the 75% points.  The standard deviation is an interval measure of deviation.  The variance is the S.D. squared.

The distribution, putting the cases in order, either in a table or a graph.   We want a linear scale, or even categories.

Frequency distributions are used.

For the example done in class, the histogramand frequency distributions were as follows:
 
 

80,000    #              9.1%

60,000    ##             18.2%

40,000    ######     54.5%

20,000                       0.0%

10,000                        0.0%
9,000  ##                    18.2%
 

THe mean and standard deviation were computed in class using Excel as follows:

 mean X-mean (x-mean)squared
80000 41636.36364 38363.63636 1471768595
60000 41636.36364 18363.63636 337223140.5
60000 41636.36364 18363.63636 337223140.5
40000 41636.36364 -1636.363636 2677685.95
40000 41636.36364 -1636.363636 2677685.95
40000 41636.36364 -1636.363636 2677685.95
40000 41636.36364 -1636.363636 2677685.95
40000 41636.36364 -1636.363636 2677685.95
40000 41636.36364 -1636.363636 2677685.95
9000 41636.36364 -32636.36364 1065132231
9000 41636.36364 -32636.36364 1065132231
                               4292545455 Sum of the squares
                                390231405 variance
                                 19754.27561 standard deviation
458000 41636.36364 416363.6364  sums
41636.36364    Mean

Oct 3:

Reliability and Validity - the quality of measurement.

Reliability - consistency.
     Two tests, see if they correlate.  Two raters.  Split/half.
      "internal consistency"  Part against the whole, we check a number of items against each of the others and against all of the others.  Coefficient alpha - a measure of consistency for questioninaire items.

We can figure out the reliability and we want reliable measures, but that's good enough.  Validity is whether it is measuring the right thing, what we meant to measure.  This is difficult conceptually.  "Intelligence"  what does it mean?  IQ test, we have items that may be reliable, consistent, but do they measure what we really really mean?

Face Validity - does it look like it is measuring the right thing?

Predictive or criterion validity.  Pragmatic.  You have to have a criterion, something to measure it against.

Other ways of testing validity are used when we don't have a good criterion.

Convergent validity - Do a number of other measures give you the same result.

Construct validity.  Does the measure work as our theory says it should.  UFO study established that the measure worked better if treated as a measure of "false memory syndrome" than if used as a measure of "experienced anomalous trauma."

Oct 8.

We spent the hour on expected frequencies and standard deviations.  There was not time to deal with regression, so we put that off until the second exam.  Here are the exercises that were worked in class:
 

8. Now, try figuring out some expected frequencies.  What would you expect to be the cell frequencies if there was no difference between Men and Women on the issue, given the marginal frequencies provided?
 
Men Women Total
Agree 9.821 15.18 25
Disagree 45.179 69.821 115
. 55 85 140

Quickie formula:  the expected frequency is rt * ct/ gt
1.  what proportion of the sample agreed?25/140 =.17857
2.  what proportion of the sample disagreed?115/140=.82143
3.  what proportion of the sample answered?  1.00
4.  How many men would we expect to agree?  How many men are there?  55  What is the likelihood that any man would agree, if men and women don't differ, .17857. = 9.821 - this is NOT a percent.  It is an expected frequency.
Use the quick formula   55 * 25/140.
5.   How manymen would we expect to disagree?  115 * 55 = /140   or  the proportion in the sample disagreeing, .82143 * 55.
6.How many women would we expect to agree:  25*85 /140 =
7.  How many women would we expect to disagree?  115*85/140

9.  The following students achieved the following scores on the midterm:   Joe, 85;  Sam, 62;  Jane, 87;  Samantha, 71;  Wendy, 78.
What is the mean score for this group?  Sum(x)/N =  383/5 = 76.6

10.  What is the standard deviation of the scores for this group?

x - mean    (x-mean)     (x - mean)2
85 - 76.6 = 8.4            70.56
62 - 76.6 = -14.6          213.16
87 - 76.6 = 10.4           108.16
71 - 76.6 = -5.6           31.36
78 - 76.6 = 1.4             1.96

        Sum of squares   (Divide by N-1; N is 5 no N-1 is 4)      425.2/4 = 106.3, which is the variance.  The standard deviation is the square root of the variance, or 10.3.

This measures the dispersion.  If this were a large, normally distributed sample, 2/3 of the people would be within one SD of the mean, i.e., between 65.7 and 86.9.  95% would be within two standard deviations of the mean, i.e., between (76 - 20.6) and 76 + 20.6)  between 55.4 and 97.2.
 
 
 
 

11.   Plot a frequency distribution for this group:

Create a linear scale and plot the cases.  The range of the scale should fit the distribution.
 

100

 90
     XX
 80  X

 70  X

 60  X

 50
 

Oct 15

Regression Analysis, calculating the formula for a straight line that most closely fits the points in a scattergram.

The formula for a line:

dependent variable =  Intercept * coefficient * independent variable

                or

Y = a + b X

a and b are "paramaters" which fix the nature of the line.  x and y are variables.  Each pair of x and y defines a point on the line.

How do we equations on a line?  Cartesian plane.  Can also be three or multi dimensional, more than three dimensions are difficult to graph.

Examples.

Take the equation y = x  (which could also be written y = 0 + 1 * x)

If X eq 1  Y = 1
If X eq -1  Y = -1
If X eq 0  Y = 0

Graphing this we see that it is a line passing through the 0,0 point at a 45 degree angle, going up from left to right.

If    y = 1 + X, the line will be pushed up one point.

If Y = -  X,  the line will go down from the upper left to the lower right.

If Y +   1- * -2 X,  the line will be lower and go down more sharply.  You can see these by plotting them on a graph (which I will not attempt to type into the notes, this will be done on the blackboard.)  There is a WEB Site which plots these sample lines.

SAMPLING:

Census - enumerate everybody - not practical, too expensive or hard to do for a large population, if you are in an organization, however, it may be easy to find people.
Sample - generalize to a population from a selection or sample -  we can generalize if the sample is "representative" -  This can be done through random selection if you have a list to select from.  Simple random sample, everyone has the same chance of appearing in the sample.  Cluster sampling.  Done for practical reasons where we don't have a list or we find travel costs too high.  If we do households, we usually cluster.  If we use the telephone, we have a list of subscribers.
  Stratified samples, this means that everyone has a known, but not equal, probability of being in the sample.  This is done so we can find out about subgroups in the sample.  In effect, each group is sampled.  Often we stratify by geographic areas because they are known in advance.
   Nonrandom sampling, done for convenience, to get variety.  But you cannot reliably generalize.  SLOPS, whoever calls in or clicks on the Internet.  This may generate anecdotes, gossip.

We can compute the "margin of error"  .  This means, we can compute how much our sample statistic is likely to vary from the population paramater.  This based on probability theory.  The larger your sample, the more certain your results. The size of the population doesn't matter.  The size of the sample in practice depends on how much you want to break things down into sub-groups of whatever kind.

Computing the margin of error, Guide to Computing Margins of Error on the WEB site.

For example,
1. In a college class with 85 students, 32 of whom are black, the mean on the midterm was 75. The standard deviation was 6.21. What is the margin of error for this mean? This is a mean score question, so I use the formula
M = 2 * sd / SQRT(N)  ;   M = 2 * 6.21/SQRT(85).  = 1.35 points, NOT %%.
 

9. A survey of the tri-county area has 356 respondents, of whom 82 are black and 55 hispanic. What is the margin of error for statistics about the opinion of the hispanic residents?   This is a percentage question, but I am not given a statistic, a percentage result.  Use Formula one, M = 1/SQRT(N).  What is N???
M - 1/sqrt(55).  = .1348.  This formula gives us a proportion, not a percent, or 13.5%.  Suppose I said 61% of the hispanic respondents are voting for McGreevy.  That is that statistic for the sample.  The population paramater might vary by as much as 13.5%.  We could say that our "confidence interval" is between  61 - 13.5   and 61 + 13.5.    or between 47.5% and 74.5%.  This means the election among Hispanics is "too close to call."
 Suppose we had 400 Hispanics, the margin of error would be 5%.  For a sample of 1000, M =  1/SQRT(1000)  or 1/31.  or 3.2%
   Suppose we wanted a 5% margin of error, how large a sample do we need?  400.  Suppose we want a 5% margin of error for each of five electoral districts, how large a sample do we need?  5 * 400, or 2000.
 

Representative or random sample   Chosen at random from either the total population (simple random sample) or from subgroups of the population.(stratified random sample)

In choosing a sample size, all that matters is the amount of error you can tolerate.  The population size is not relevant.

A researcher wants to obtain a margin of error of no more than 2% in a survey of a county with a
population of 3,000,000. How large a sample is needed?    N = 1/(M*M).  M is the margin of error, expressed as a proportion.   M = .02 because it says 2%.  N = 1/(.02*.02)    N= 1/.022    N =2500.  Simple random sample.

Suppose we were going to do this for five counties, and we wanted a 2% margin of error for each?  How large a sample would we need?  A 2% margin of error requires 2500, but we need it for eachcounty so we need 5 * 2500 or  12500.  Stratified random sample, consisting of a simple random sample of each of the subgroups.

3. 59% of the respondents in a survey of a state with seven million Republican voters voted for Bush,
41% for Gore. There were 625 respondents. What is the margin of error for the percent voting for Bush?
 M =  2 * SQRT((p * (1-p))/N).  What is p, the proportion of respondents giving a certain response.  The sample statistic.  In this case, what is p?  .59  What is N?   M =  2 * SQRT((.59 * (1-.59))/625). = .03935 as a proportion, or 3.94%  expressed as a percentage.

What does that mean?  We can be "95% sure" that the population paramater (the true value for the population) is witin 3.94% of the sample statistic.  One way to express this is as a "confidence interval".
The lower bound of the confidence interval is the sample statistic minus the margin of error, in this 59-3.94 = 55.06%.
The upper bound of the confidence interval is the sample statistic plus the margin of error, in this case 59+3.94 = 62.94%.
We are confident that the true figure, the "population paramater" is between 55.06% and 62.94%.

If 47% vote for Bush, 49% for Gore and 4% for Nader.  A sample of 1200.  What is the margin of error for the Nader vote?
p = .04,  What is 1-p?  .96        M =  2 * SQRT((.04 * (1-.04))/1200) = .0113 or 1.13%
What is the margin of error for the Gore vote?  M =  2 * SQRT((.49 * (1-.49))/1200) = .0289 or 2.89%.
 

October 24

Questions on page 91.

1.  The difference between a census and a sample?  Census enumerates the whole population.  A sample is a portion selected to be representative.

2.  Parameter and a statistic?   The statistic comes from a sample, the "parameter" is the "true" population value.

3.  Confidence level? How certain we can be of the margin of error, it is almost always set at 95%.
     Confidence interval?  The range within which the population parameter is 95% certain to fall.
     Margin of error?  The amount by which we are 95% confident that the sample statistic may differ from the population parameter.

We get the upper bound of the confidence interval by adding the margin of error to the sample statistic.
We get the lower bound of the confidence interval by subtracting the margin of error from the sample statistic.

If the confidence interval is plus or minus 3%,it means that we are 95% sure that the true population parameter is within 3% of the sample statistic.

I would say, the margin of error is 3%.   If the sample statistic was, for example, 55% supporting McGreevy, the confidence interval would be 52% to 58%.

4.  Simple sample, throw darts at the directory, write them on slips of paper and put them in a hat, use a table of random numbers.  Systematic sample, choose every 50th or 100th case.
 

Nonresponse bias:  -  people don't answer.
Selective availability - some people are home more.
Areal bias, - household surveys, neighborhoods are more convenient.

------------
Survey Research, Oct 26

Asking questions.   Open or closed ended.  You want to get their opinion, not a socially appropriate response, so you try to be neutral.  People tend to view it as a test, to seek approval of the interviewer.  People get satisfaction from the chance to express themselves in a non-judgmental atmosphere.

Refer to Tronchim for examples of different kinds of questions:  dichotomous, Likert, etc.

Oct 29

There were a number of problems with the Margins of Error homework assignment, so we will go over some of the items.

5. In a survey of 1000 voters, 600 were Democrats, 300 Republicans and 100 Libertarian. 65% of the Republicans favored George Bush in the  primary. What is the margin of error for this percentage?

Which formula do we use?THe fact that it saysw   65% of the Republicans ttells us it has to be formula two.  M =  2 * SQRT((p * (1-p))/N) -= p is the proportion giving a certain answer, in this case .65
            M =  2 * SQRT((.65 * (.35))/300)  N is 300 because the question says "of the Republicans".
            .055  or 5.51%.  -  the answer should be in %.

6. A survey is to be conducted of attitudes among white, black and hispanic respondents in Camden County. The population is 300,000. Of thispopulation, 80% is white, 15% is black and 4% is Hispanic. The researcher wants to achieve a 3% margin of error for the estimates for each of the groups. How large a sample is needed?

      Which formula?     formula 3         N = 1/(M*M).  The only unknown is M, the margin of error that is required.  We convert this to a proportion,  a 3% margin of error becomes .03.   =   1111.11
Since we have three groups, and we need a 3% margin of error for EACH, we need 3 times 1112 = 3336.
    What kind of a sample is this?  A stratified sample, which means in effect three sub-samples.

We examined a number of graphs, which are linked from the home page.  Polar area diagrams invented by Florence Nightingale.  Anscombe's quartet demonstrated how the same regression equation may fit a number of different distributions.

Sample question based on a graph from the BJS:
Which decade had a marked increase in the homicide rate:
  a.  1955-64   b. 1965-74    c.  1975-84   etc.

We examined a

Nov 2:

Taking the examle of a survey in whihc 54.3% said they voted for Clinton.  We know that in the population Clinton got just under 50%, let's say 49.75%.  Could this difference be due to sampling error?  We have to know the sample size, how many were asked the question, n = 870
We use the formula for cases where we are given a percentage vote,  M =  2 * SQRT((p * (1-p))/N).

            p = .543   1 - p = .457          M =  3.38%.   A confidence interval lower bound would be 54.3% - 3.38 = 50.92%

Nov 7.  Using wages as the IV and Suicide as the DV, we found that the correlation was positive, .705, and p=.000, it was significant.  The regression equation   Suicide  Rate =        10.563              +               .182            * Wages

It is not a linear relationship because we can see significant breaks in the patter., We can see that there are two clusters of cases.  This is not uncommon with time series, because a lot of things change together over time.

We developed an Excel file as an example, using data on trends in Gonorrhea rates from 70 to 75, which was a period of rapid growth.  A linear extrapolation showed that they would continue to a much higher rate.  However, in the real world, 1975 was a turning point and the rate went down.
 

--   Nov 9
We did an example of the Trends assignment, giving some possible explanations.  The explanations may vary, but you should have the description of the trends correct.  The results we got were as followws:
 

Select the variable "10) HomicideRate"  and examine the Time Series Graph.
   Q1:   Which years had the highest homicide rates?    What happened in those years that might explain the trend?

There was a peak in 1934, then another long peak in the 1970s and 1980s.  The first seems to have been correlated with the prohibition era, the second with the period of social unrest in the 1970s and 80s related to Vietnam, racial conflict, andso on.  Homicide seemed to increase during periods of economic affluence.

Select the variable "9) Suicide Rate" and examine the graph
   Q2:  What years had the highest suicide rates?  What happened in those years that might explain the trend.
Suicide was highest in 1908-1912, then peaked again in 1930.  It seemed to rise during periods of affluence.  Looks like the gold standard might be involved, when we went off it, it came down.  Tight money policies.
 

Return to the menu and select both the Homicide Rate and the Suicide Rate.    Print this table out and staple it to this report if you answer the questions by hand.  If you are typing the questions for submission to WEBCT, copy the table and paste it into your report.

       Q3:   Do these trends appear to be related?  Is the relationship the same or different in different decades?  What happened in the 1940s?  In the 1980s?

Yes, they appear to be correlated, both reached peaks in 1932, In the 40s they both went down.

Return to the Menu and compute the Scatterplot with the Suicide Rate as the Dependent Variable and the Divorce Rate as the Independent Variable.  .  Select the Regression Line and the Residuals.   Print out the Scatterplot and attach it to this assignment.

        Q4:  What is the correlation coefficient?    -.126         Is it statistically significant?      np       Is it positive or negative?neg

        Q5:  Fill in the regression equation:      Suicide  Rate =         12.616             -   .038                       *  Divorce  Rate

         Q6:  Based on your examination of the Scatterplot, would you say that the relationship between the Suicide Rate and Divorce Rate is linear? That is, would you say that a straight line - the regression line - is a good approximation of the pattern?  Would you say that the divorce rate is a causal factor that helps to explain the suicide rate?

No, this is not a linear pattern, something else must be going on.

---

We spent the rest of the class doing an example of the Excel assignments.

November 28 - Causal Analysis

Probabilistic cause, not an absolute cause, not a cause that is sufficient or necessary.   "Cigarette smoking causes cancer."  WHat we mean is, smoking cigarettes increases the likelihood of getting cancer.  How much?

There are multiple causes for everything.  What we want to find out is how much each thing contributes.  There are also causal linkages, or indirect causes.  A causes B and then B causes C.

Diagraming causal models.  We put the dependent variable at the right.  We draw arrows going into it for each causal variable that effects it directly.  Then we can have arrows that go into the arrows, steps into the causal analysis, as in this sample file:
http://crab.rutgers.edu/~goertzel/homomale.htm

Criteria of Causation - how do we know that something is a cause of something else.

1.  Time Order.  The cause comes before the effect.  Sometimes we sort out the time order theoretically, we assume that education preceeds employment.  Or we can use a research design that involves gathering data at two points in time.  If you don't have measurements at two points in time, this is shaky.

2.  Correlation.  The two variables vary together.  When one is high, the other is high OR when one is low the other is high.  This gets at the degree of causation, the higher the correlation the strong the causal relationship.

3.  non-spuriousness,  we want to know that the correlation is not cause by something else.  We can test this with an experimental design, if feasible.  Or we can use statistical controls, which are not quite as convincing but its all you do in many cases.

November 30

Causal aspects of variables/. This has to do with our causal model, it is not inherent in the variables.

Dependent Variable - that is what we want to explain.  Often these are opinions or behaviors

Independent Variable - what we use to explain it.  Often there are traits or physical characteristics, e.g., sex or race, almost always independent.

If you studies the relationship of race on voting, for example, race would be independent and voting dependent.

Antecedent variables, things come before the independent variable.  This helps us to deal with a causal chain.  Antecedent variable cause IV which causes the DV.

Intervening Variables, this that are intervening, e.g.   Race determines ideology which determines the vote.

We demonstrated assignment 5a, and typed the answers in Microsoft Word.  If you want to do this assignment on WEBCT it is better to use Word instead of Netscape Composer because it has a drawing tool.  You cansubmit the assignment in Word.

December 3 - we went over the results of our survey, the frequencies and two stories are available online.  Copies of a paper called "Myths of Murder and Multiple Regression" were distributed, copies are in the "Papers" folder on our WEBCT site.  An abstract is as follows:
Multiple regression has consistently failed to provide definitive answers to policy controversies in criminal justice, yet researchers continue to attempt to use regression techniques for this purpose.   This review of multiple regression analyses of trends in homicide rates suggests that the method fails because researchers overfit their models to one data set, then fail to test them with fresh data.   The lack of progress in regression modeling of homicide trends over several decades suggests that the trends may be chaotic.  Studies that disaggregate trends and combine qualitative with quantitative data have been much more successful.

December 5.  We went over the second midterm, working some regression problems; the answers are online.  We reviewed a WEB site on path analysis, noting the statement that  from Everitt and Dunn (1991): "However convincing, respectable and reasonable a path diagram... may appear, any causal inferences extracted are rarely more than a form of statistical fantasy".  We looked at some examples of trend graphs from the paper on Myths of Murder and Multiple Regression that provide more reliable information about causality.
 
 

Dec 7 - an experiment is not just trying something out, it is a specific research design.  Two key characteristics:  before and after measurement, and manipulation of the Independent Variable.  We actually do something to people and observe the consequences.  This differs from observing behavior in a natural setting OR asking people questions.  This is a very rigorous way to test causal relationships.  It is rigorous because we can control for extraneous factors or variables.  To be rigorous, an experiment needs a control group, and the control group has to be the same as the experimental group EXCEPT for the one independent variable.

Problems with experiments:
1.  You can't really experiment on a lot of variables because you can't control them.  It may unethical to do so.
2.  They are artificial and you don't if the real world would be the same.
3.  The experiment itself may change things:  testing effects.  Placebo effect.
4.   Practical problems:  people drop out (experimental mortality), history (things go on in the outside world that effect the experiment).  E.g., field experiment on welfare reform.

Internal validity - was it done correctly.
External validity - whether the experiment can be generalized to the outside world.

Evaluation research:  prison, hospital, educational institution.  Anyplace where you are doing things to people anyway.  They key is to assign people at random, and there is often resistance to this.
 

December 10 -

Content Analysis - "unobtrusive data"  Data created by a bureaucratic system, e. g. police records, or often by the media.  Television or Newspapers either because that is our interest, the media, or as a way of getting information, e.g., on crime reported in the news.

Similar to survey research, except that you do coding instead of interviewing.  Coding means that you assign numbers to phenomena that you observe.  Counting things.  Each of your variables is coded from the published information.

Conceptualization.
Measurement.  Reliability and Validity.

Manifest Content - what's it's about on the surface
Latent Content - things that we infer about the content, e.g., does the writer sound angry?  Indignation, sexy?

This class is 50 920 301
 
 
 
 
 

Sampling - which content do you look at?

You can go back in history, and your work can be checked up or replicated.

Data analysis is about the same as for survery research, the only difference is that the unit of analysis is the story or tv show or whatever rather than a person who was interviewed.