There should only be 2 possible outcomes or levels for that variable. Take a look: Since admit admission to grad school, yes or no is our outcome variable here, does it show you exactly 2 levels of admit in the table? A conditional density plot will also show us if our distribution looks binary. We tell R to add a variable called admit2 to mydata, which is a factored version of our original admit.
Above we ran a basic glm model and looked at main effects. Here, rank has an order and we can analyze the potential change in the outcome admission to grad school depending on each level of rank. The p-value in the above wald test tells you if the overall effect of that categorical variable is significant. You can use the update function instead. This takes some work. That involves making a bunch of new datasets.
This is my annotated version of the UCLA tutorial for beginners, along with titling the graph because untitled graphs are my pet peeve. We also have to constantly remind R that our factors are indeed factors. Interpreting the graph: Each line is the predicted probability of being admitted to grad school for each institutional rank. The legend tells us that red represents 1 ranked institutions, green represents 2 ranked institutions, blue represents 3 ranked institutions, and purple represents 4 ranked institutions in order from top ranked to lowest ranked.
Each is set agaist the color-coded confidence interval. We can see that the intercepts all fall in rank order and that for each institution rank, the positive slope shows the predicted probability of being admitted to grad school increases as GRE score increases.
We see that, in general across both groups, the older women are, the more likely they are to use contraceptives. However, we see an Group by Age interaction. The coefficients have an additive effect in the log y scale and the IRR have a multiplicative effect in the y scale.
For additional information on the various metrics in which the results can be presented, and the interpretation of such, please see Regression Models for Categorical Dependent Variables Using Stata, Second Edition by J. Scott Long and Jeremy Freese To understand the model better, we can use the margins command.
Below we use the margins command to calculate the predicted counts at each level of prog , holding all other variables in this example, math in the model at their mean values.
In the output above, we see that the predicted number of events for level 1 of prog is about. The predicted number of events for level 2 of prog is higher at.
Note that the predicted count of level 2 of prog is. This matches what we saw in the IRR output table. Below we will obtain the predicted counts for values of math that range from 35 to 75 in increments of Interest centers on whether the different regions tend to have different crime rates. Table 4. While a Poisson regression model is a good first choice because the responses are counts per year, it is important to note that the counts are not directly comparable because they come from different size schools.
This issue sometimes is referred to as the need to account for sampling effort ; in other words, we expect schools with more students to have more reports of violent crime since there are more students who could be affected. We cannot directly compare the 30 violent crimes from the first school in the data set to no violent crimes for the second school when their enrollments are vastly different: 5, for school 1 versus for school 2.
We can take the differences in enrollments into account by including an offset in our model, which we will discuss in the next section. Note that there is a noticeable outlier for a Southeastern school 5. We therefore combined the SW and SE to form a single category of the South, and we also removed the extreme observation from the data set. In addition, the regional pattern of rates at universities appears to differ from that of the colleges.
Although working with the observed rates per students is useful during the exploratory data analysis, we do not use these rates explicitly in the model. The counts per year are the Poisson responses when modeling, so we must take into account the enrollment in a different way. Our approach is to include a term on the right side of the model called an offset , which is the log of the enrollment, in thousands. There is an intuitive heuristic for the form of the offset. Thus, we have reason to question the Poisson regression assumption of variability equal to the mean; we will have to return to this issue after some initial modeling.
The fact that the variance of the rate of violent crimes per students tends to be on the same scale as the mean tells us that adjusting for enrollment may provide some help, although that may not completely solve our issues with excessive variance.
We are interested primarily in differences in violent crime between institutional types controlling for difference in regions, so we fit a model with region, institutional type, and our offset.
Note that the central region is the reference level in our model. The estimated coefficient of 0. A Wald-type confidence interval for this factor can be constructed by first calculating a CI for the coefficient 0. Comparisons to regions other than the Central region can be accomplished by changing the reference region. This method helps control the large number of false positives that we would see if we ran multiple t-tests comparing groups. The honestly significant difference compares a standardized mean difference between two groups to a critical value from a studentized range distribution.
We find that the Northeast has significantly higher rates of violent crimes than the Central, Midwest, and Western regions, while the South has significantly higher rates of violent crimes than the Central and the Midwest, controlling for the type of institution. These results certainly suggest significant differences in regions and type of institution.
However, the EDA findings suggest the effect of the type of institution may vary depending upon the region, so we consider a model with an interaction between region and type. These results provide convincing evidence of an interaction between the effect of region and the type of institution. A drop-in-deviance test like the one we carried out in the previous case study confirms the significance of the contribution of the interaction to this model.
For example, our model estimates that violent crime rates are The residual deviance One possibility is that there are other important covariates that could be used to describe the differences in the violent crime rates. Without additional covariates to consider, we look for extreme observations, but we have already eliminated the most extreme of the observations.
In the absence of other covariates or extreme observations, we consider overdispersion as a possible explanation of the significant lack-of-fit.
Overdispersion suggests that there is more variation in the response than the model implies. Under a Poisson model, we would expect the means and variances of the response to be about the same in various groups. Without adjusting for overdispersion, we use incorrect, artificially small standard errors leading to artificially small p-values for model coefficients.
We may also end up with artificially complex models. We can take overdispersion into account in several different ways. The simplest is to use an estimated dispersion factor to inflate standard errors. Another way is to use a negative-binomial regression model.
We begin with using an estimate of the dispersion parameter. It will be larger than one in the presence of overdispersion. Our process for model building and comparison is called quasilikelihood —similar to likelihood but without exact underlying distributions.
If we choose to use a dispersion parameter with our model, we refer to the approach as quasilikelihood. The following output illustrates a quasi-Poisson approach to the interaction model:. In the absence of overdispersion, we expect the dispersion parameter estimate to be 1. The estimated dispersion parameter here is much larger than 1. The larger estimated standard errors in the quasi-Poisson model reflect the adjustment.
For example, the standard error for the West region term from a likelihood based approach is 0. This term is no longer significant under the quasi-Poisson model. In fact, after adjusting for overdispersion extra variation , none of the model coefficients in the quasi-Poisson model are significant at the.
This is because standard errors were all increased by a factor of 2. Drop-in-deviance tests can be similarly adjusted for overdispersion in the quasi-Poisson model. In this case, you can divide the test statistic per degree of freedom by the estimated dispersion parameter and compare the result to an F-distribution with the difference in the model degrees of freedom for the numerator and the degrees of freedom for the larger model in the denominator.
The output below tests for an interaction between region and type of institution after adjusting for overdispersion extra variance :.
Another approach to dealing with overdispersion is to model the response using a negative binomial instead of a Poisson distribution. You may recall that negative binomial random variables take on non-negative integer values, which is consistent with modeling counts. These results differ from the quasi-Poisson model. Several effects are now statistically significant at the.
In this case, compared to the quasi-Poisson model, negative binomial coefficient estimates are generally in the same direction and similar in size, but negative binomial standard errors are somewhat smaller. In summary, we explored the possibility of differences in the violent crime rate between colleges and universities, controlling for region.
Our initial efforts seemed to suggest that there are indeed differences between colleges and universities, and the pattern of those differences depends upon the region. However, this model exhibited significant lack-of-fit which remained after the removal of an extreme observation. In the absence of additional covariates, we accounted for the lack-of-fit by using a quasilikelihood approach and a negative binomial regression, which provided slightly different conclusions.
Sometimes when analyzing Poisson data, you may see many more zeros in your data set than you would expect for a Poisson random variable. This survey was conducted on a dry campus where no alcohol is officially allowed, even among students of drinking age, so we expect that some portion of the respondents never drink.
The non-drinkers would thus always report zero drinks. However, there will also be students who are drinkers reporting zero drinks because they just did not happen to drink during the past weekend.
Our zeros, then, are a mixture of responses from non-drinkers and drinkers who abstained during the past weekend. The purpose of this survey is to explore factors related to drinking behavior on a dry campus. What proportion of students on this dry campus never drink? What factors, such as off-campus living and sex, are related to whether students drink? Among those who do drink, to what extent is moving off campus associated with the number of drinks in a weekend?
Answering these questions would be a simple matter if we knew who was and was not a drinker in our sample. Unfortunately, the non-drinkers did not identify themselves as such, so we will need to use the data available with a model that allows us to estimate the proportion of drinkers and non-drinkers.
Each line of weekendDrinks. We will also consider whether a student is likely a firstYear student based on the dorm they live in. Here is a sample of observations from this data set:. As always we take stock of the amount of data; here there are 77 observations. We proceed with that in mind. A premise of this analysis is that we believe that those responding zero drinks are coming from a mixture of non-drinkers and drinkers who abstained the weekend of the survey.
The mean number of drinks reported the past weekend is 2. Because our response is a count, it is natural to consider a Poisson regression model. The next step in the EDA is especially helpful if you suspect your data contains excess zeros. Comparing this Poisson distribution to what we observed Figure 4.
This circumstance actually arises in many Poisson regression settings. This type of model is referred to as a zero-inflated Poisson model or ZIP model. We first fit a simple Poisson model with the covariates off.
Both covariates are statistically significant, but a goodness-of-fit test reveals that there remains significant lack-of-fit residual deviance: In the absence of important missing covariates or extreme observations, this lack-of-fit may be explained by the presence of a group of non-drinkers.
A zero-inflated Poisson regression model to take non-drinkers into account consists of two parts:. The form for each part of the model follows. The first part looks like an ordinary Poisson regression model:. The second part has the form. There are many ways in which to structure this model; here we use different predictors in the two pieces, athough it would have been perfectly fine to use the same predictors for both pieces, or even no predictors for one of the pieces.
How is it possible to fit such a model? The ZIP model is a special case of a more general type of statistical model referred to as a latent variable model.
More specifically, it is a type of a mixture model where observations for one or more groups occur together and the group membership is unknown. Zero-inflated models are a particularly common example of a mixture model, but the response does not need to follow a Poisson distribution. Likelihood methods are at the core of this methodology, but fitting is an iterative process where it is necessary to start out with some guesses or starting values. Here is the general idea of how ZIP models are fit.
Imagine that the graph of the Poisson distribution in Figure 4. Some zero responses will remain. The likelihood is used and some iterating in the fitting process is involved because the Poisson distribution in Figure 4. Furthermore, the likelihood incorporates the predictors, sex and off. So there is a little more to it than computing the proportion of zeros, but this heuristic should provide you a general idea of how these kinds of models are fit.
We will use the R function zeroinfl from the package pscl to fit a ZIP model. Again, we could have used the same covariates for the two pieces of a ZIP model, but neither off. As we have done with previous Poisson regression models, we exponentiate each coefficient for ease of interpretation. Exponentiating the coefficient for the first-year term for this model yields 3.
More on this in Chapter 6. Moving from ordinary Poisson to zero-inflated Poisson has helped us address additional research questions: What proportion of students are non-drinkers, and what factors are associated with whether or not a student is a non-drinker? While a ZIP model seems more faithful to the nature and structure of this data, can we quantitatively show that a zero-inflated Poisson is better than an ordinary Poisson model?
First, let's take a look at these five assumptions:. Assumptions 1 and 2 should be checked first, before moving onto assumptions 3, 4, and 5. Just remember that if you do not run the statistical tests on these assumptions correctly, the results you get when running Poisson regression might not be valid.
Also, if your data violated Assumption 5, which is extremely common when carrying out Poisson regression, you need to first check if you have "apparent Poisson overdispersion". Apparent Poisson overdispersion is where you have not specified the model correctly such that the data appears overdispersed. Therefore, if your Poisson model initially violates the assumption of equidispersion, you should first make a number of adjustments to your Poisson model to check that it is actually overdispersed.
In the section, Procedure , we illustrate the SPSS Statistics procedure to perform a Poisson regression assuming that no assumptions have been violated. First, we introduce the example that is used in this guide. The Director of Research of a small university wants to assess whether the experience of an academic and the time they have available to carry out research influences the number of publications they produce.
Therefore, a random sample of 21 academics from the university are asked to take part in the research: 10 are experienced academics and 11 are recent academics. The number of hours they spent on research in the last 12 months and the number of peer-reviewed publications they generated are recorded. The 13 steps below show you how to analyse your data using Poisson regression in SPSS Statistics when none of the five assumptions in the previous section, Assumptions , have been violated.
At the end of these 13 steps, we show you how to interpret the results from your Poisson regression. However, the procedure is identical. Note: Whilst it is standard to select Poi s son loglinear in the area in order to carry out a Poisson regression, you can also choose to run a custom Poisson regression by selecting C ustom in the area and then specifying the type of Poisson model you want to run using the Distrib u tion: , Link f unction: and —Parameter— options.
Note 1: If you have ordinal independent variables, you need to decide whether these are to be treated as categorical and entered into the F actors: box, or treated as continuous and entered into the C ovariates: box.
They cannot be entered into a Poisson regression as ordinal variables. Note 2: Whilst it is typical to enter continuous independent variables into the C ovariates: box, it is possible to enter ordinal independent variables instead. However, if you choose to do this, your ordinal independent variable will be treated as continuous.
Note 3: If you click on the button the following dialogue box will appear: In the —Category Order for Factors— area you can choose between the A scending , D escending and U se data order options. These are useful because SPSS Statistics automatically turns your categorical variables into dummy variables. Unless you are familiar with dummy variables, this can make it a little tricky to interpret the output from a Poisson regression for each of the groups of your categorical variables.
Therefore, making changes to the options in the —Category Order for Factors— area can make it easier to interpret your output. Note 1: It is in the dialogue box that you build your Poisson model. In particular, you determine what main effects you have the option , as well as whether you expect there to be any interactions between your independent variables the option.
If you suspect that you have interactions between your independent variables, including these in your model is important not only to improve the prediction of your model, but also to avoid issues of overdispersion, as highlighted in the Assumptions section earlier. Note 2: You can also build nested terms into your model by adding these into the T erm: box in the —Build Nested Term— area.
We do not have nested effects in this model, but there are many scenarios where you might have nested terms in your model. Note: There are a number of different options you can select within the —Parameter Estimation— area, including the ability to choose a different: a scale parameter method i.
There are also a number of specifications you can make in the —Iterations— area in order to deal with issues of non-convergence in your Poisson model. Note 1: In the area, you can choose between the Wald and Likelihood ratio based on factors such as sample size and the implications that this can have for the accuracy of statistical significance testing.
0コメント