Larry Wasserman's (positive) review of "The Search for Certainty" by Krzysztof Burdzy

Larry sent me this review of a book on the philosophy of statistics that Christian and I reviewed recently, which I'll paste in below. Then I'll offer a few comments of my own.
Larry writes:
After reading the reviews of Kris Burzdy's book "The Search for Certainty" that appeared on the blogs of Andrew Gelman and Christian Robert, I was tempted to dismiss the book without reading it. However, curiosity got the best of me and I ordered the book and read it. I am glad I did. I think this is an interesting and important book.
Both Gelman and Robert were disappointed that Burzdy's criticism ofphilosophical work on the foundations of probability did not seem tohave any bearing on their work as statisticians. But that wasprecisely the author's point. In practice, statisticians completelyignore (or misrepresent) the philosophical foundations espoused by deFinetti (subjectivism) and von Mises (frequentism). This is itself adamning criticism of the supposed foundational edifice of statistics.Burdzy makes a convincing case that the philosophy of probability is acomplete failure.
He criticizes von Mises because his theory, based on defining limits of sequences (or collectives) does not assign a probability to a given event. (There are also technical issues with the mathematical definition of a collective that von Mises was unable to resolve but these can be fixed rigorously using modern computational complexity theory. But that doesn't blunt the force of Burzdy's main criticism.)
His criticism of de Finetti is more thorough. There is the usualcriticism, namely, that subjective probability is unscientific as itis not falsifiable. Moreover, there is no guidance on how to actuallyset probabilities. Nor is there anything in de Finetti to suggest thatprobabilities should be based on informed prior opinion, as manyBayesians would argue. More surprising is Burdzy's claim thatsubjective probability has the same problem as von Mises' frequencytheory: it does not provide probability for an individual event. Thisclaim will raise the hackles of die-hard Bayesians. But he is right:de Finetti's coherence argument requires that you bet on severalevents. The rules of probability arise from the demand that you avoida sure losing bet (a Dutch book) on the collection of bets. Theargument does not work if we supply a probability only on a singleevent. The criticisms of de Finetti's subjectivism go beyond this andI will not attempt to summarize them.
Burdzy provides his own foundation for probability. His idea is thatprobability should be a science, not a philosophy, and that, as such,it should be falsifiable. Allow me to make an analogy. Open anyelementary book on quantum mechanics and you will find a set ofaxioms. These axioms can be used to make very specific predictions.If the predictions are wrong, (and they never have been), then theaxioms would be rejected. But to use the axioms, one must inject somespecifics. In particular, one must supply the Hamiltonian for theproblem. If the resulting predictions fail to agree with reality, wecan reject that Hamiltonian.
To make probability scientific, Burzdy proposes laws that lead tocertain predictions that are vulnerable to falsification. Moreimportantly, the specific probability assignments we make are open tobeing falsified. Before stating his laws, let me emphasize acrucial aspect of Burzdy's approach. Probability, he claims, is thesearch for certainty; hence the title of the book. That might seemcounter to how we think of probability but I think his idea iscorrect. In frequentist theory, we make deterministic predictionsabout limits of sequences. In subjectivist theory, we make thedeterministic claim that if we assign probabilities consistent withthe rules of probability then we are certain to be immune to a Dutchbook. A philosophy of probability, according to Burdzy, is the searchfor what claims we can make for certain.
Burdzy's proposal is to have laws -- not axioms -- of probability.Axioms, he points, merely encode fact we regard as uncontroversial.Laws instead, are proposals for a scientific theory that are open tofalsification. Here are his five proposed laws (paraphrased):
(L1) Probabilities are numbers between 0 and 1.
(L2) If A and B are disjoint then P(A or B) = P(A) + P(B).
(L3) If A and B are physically independent then they aremathematically independent meaning that P(A and B) = P(A)P(B).
(L4) If there exists a symmetry on the space of possible outcomeswhich maps an event A onto an event B then P(A)=P(B).
(L5) P(A)=0 if and only if A cannot occur. P(A)=1 if and only if it must occur.
Some comments are in order. (L1) and (L2) are standard of course.(L4) refers to ideas like independent and identically sequences, orexchangeability. It is not an appeal to the principle ofindifference. Quite the opposite. Burdzy argues that introducingsymmetry requires information, not lack of information.
(L3) and (L4) are taught in every probability course as add-ons. Butin fact they are central to how we actually construct probabilities inpractice. The author asks: Why treat them as follow-up ideas? Theyare so central to how we use probability that we should elevate themto the status of fundamental laws.
(L5) is what makes the theory testable. Here is how it works. Basedon our probability assignments, we can construct events A that haveprobability very close to 0 or 1. For example, A could be the eventthat the proportion of heads in many tosses is within .00001 of 1/2.If this doesn't happen, then we have falsified the probabilityassignment. Of course P(A) will rarely be exactly 0 or 1, rather, itwill be close to 0 or 1. But this is precisely what happens in allsciences. We can test prediction of general relativity of quantummechanics to a level of essential certainty, but never exactcertainty. Thus Burdzy's approach puts probability on a level thesame as other scientific theories.
To summarize, Burdzy's approach is to treat probability as ascientific theory. It has rules for making probability assignmentsand the resulting probabilities can be falsified. Not only is thissimple, it is devoid of the murkiness of subjectivism and the weaknessof von Mises' frequentism. And, perhaps most importantly, it reflectshow we use probability. It also happens to be easy to teach. My onlycriticism is that I think the implications of (L1)-(L5) could befleshed out in more detail. It seems to me that they work well forproviding a foundation for testable frequency probability. That is,it provides a convincing link between probability and frequency. Butthat could reflect my own bias towards frequency probability. Moredetail would have been nice.
My short summary of this book does not do justice to the author'sarguments. In particular, there is much more to his critique ofsubjective probability than I have presented in this review. The bestthing about this book is that it will offend and annoy bothfrequentists and subjectivists. I implore my friends on both sides ofthe philosophical divide to read the book with an open mind.
My reply:
1. Whatever von Mises's merits (or lack thereof) in general, I can't take him seriously as a philosopher of statistical practice (see pages 3-4 of this article).
2. As I wrote earlier, Burdzy's comments about subjectivism may or may not be accurate, but they have nothing to do with the Bayesian data analysis that I do. In that sense, I don't think that Larry's comment about "both sides of the philosophical divide" is not particularly helpful. I see no reason to choose between two discredited philosophies, and in fact in chapter 1 of BDA we are very clear about the position we take, which indeed is completely consistent with Popper's ideas of refutation and falsifiability.
As I wrote before, "My guess is that Burdzy would differ very little from Christian Robert or myself when it comes to statistical practice. . . . but I suppose that different styles of presentation will be effective with different audiences." Larry's review suggests that there are such audiences out there.



Numbers Rule Your World

Numbers Rule Your World: The hidden influence of probability and statistics on everything you do." It is published by McGraw-Hil


Economics and voter irrationality: my review of The Myth of the Rational Voter

By Andrew Gelman on December 27, 2009 4:02 PM

I recently reviewed Bryan Caplan's book, The Myth of the Rational Voter, for the journal Political Psychology. I wish I thought this book was all wrong, because then I could've titled my review, "The Myth of the Myth of the Rational Voter." But, no, I saw a lot of truth in Caplan's arguments. Here's what i wrote:
Bryan Caplan's The Myth of the Rational Voter was originally titled "The logic of collective belief: the political economy of voter irrationality," and its basic argument goes as follows:
(1) It is rational for people to vote and to make their preferences based on their views of what is best for the country as a whole, not necessarily what they think will be best for themselves individually.
(2) The feedback between voting, policy, and economic outcomes is weak enough that there is no reason to suppose that voters will be motivated to have "correct" views on the economy (in the sense of agreeing with the economics profession).
(3) As a result, democracy can lead to suboptimal outcomes--foolish policies resulting from foolish preferences of voters.
(4) In comparison, people have more motivation to be rational in their conomic decisions (when acting as consumers, producers, employers, etc). Thus it would be better to reduce the role of democracy and increase the role of the market in economic decision-making.
Caplan says a lot of things that make sense and puts them together in an interesting way. Poorly informed voters are a big problem in democracy, and Caplan makes the compelling argument that this is not necessarily a problem that can be easily fixed--it may be fundamental to the system. His argument differs from that of Samuel Huntington and others who claimed in the 1970s that democracy was failing because there was too much political participation. As I recall, the "too much democracy" theorists of the 1970s saw a problem with expectations: basically, there is just no way for "City Hall" to be accountable to everyone, thus they preferred limiting things to a more manageable population of elites. Caplan thinks that voting itself (not just more elaborate demands for governmental attention) is the problem.
Bounding the arguments
I have a bunch of specific comments on the book but first want to bound its arguments a bit.
First, Caplan focuses on economics, and specifically on economic issues that economists agree on. To the extent the economists disagree, the recommendations are less clear. For example, some economists prefer a strongly graduated income tax, others prefer a flat tax. Caplan would argue, I think, that tax rates in general should be lowered (since that would reduce the role of democratic government in the economic sphere) but it would still be up to Congress to decide the relative rates. This isn't a weakness of Caplan's argument; I'm just pointing out a limitation of its applicability. For another example, Caplan asks, "Why are inefficient policies like the minimum wage popular?" Isn't this a question of values? My impression is that some economists support a higher minimum wage, some don't.
More generally, non-economic issues--on which there is no general agrement by experts--spread into the economic sphere. Consider policies regarding national security, racial discrimination, and health care. Once again, I'm not saying that Caplan is wrong in his analysis of economic issues, just that democratic goverments do a lot of other things. (At one place he points out that the evidence shows that voters typically decide whom to vote for based on economic considerations; see, for example, Hibbs (2008). But, even thought the economy might be decisive on the margin, that doesn't mean these other issues don't matter.)
Another example is Caplan's discussion of toxicology, an area that I happen to have worked in. One of the difficulties is that people underestimate some risks and overestimate others. Thus, simple advice such as "worry" or "don't worry" aren't so helpful. Especially since there is typically a lag between exposure and health problems. In his discussion, Caplan ignores some of the political factors. One one side, industry has a lot of motivation to downplay the risks, and they do lots of lobbying in Congress. On the other hand, agencies such as the EPA sometimes are motivated to overstate risks. So these views don't occur in a vacuum.
Finally, Caplan generally consideres democracy as if it were direct. But I think representative democracy is much different than direct democracy. Caplan makes some mention of this, the idea that politicians have some "slack" in decision-making, but I suspect he is understating the importance of the role of the politicians in the decision-making process.
Specific comments
Later on, he writes, "What is the full price of ideological loyalty? It is the material wealth you forego in order to believe." I think that's part of it but not all. For example, suppose I have a false belief that the economic policy of party A will be good for the country. I (and others like me) vote for A, the party wins the election, implements the policy, and things get worse (compared to what would have happened had party B won). I will be a little unhappy to hear about the problems in the national economy. To the extent I care about others (and, as Caplan notes, that's why I'm voting in the first place, also probably a big motivation of why he wrote his book), if I have loyalty to a bad ideology, I'll pay the price in terms of a negative national outcome, even if I'm not personally affected.
Regarding the views of economists and others, I was surprised to see Caplan write, "Out of all the complaints that economists lodge against laymen, four families of beliefs stand out . . . antimarket bias, anti-foreign bias, make-work bias, and pessimistic bias." I'm surprised to hear this, because I thought that the two concepts that economists thought were most important (and ignored by noneconomists) were (a) opportunity cost, and (b) externalities. These two concepts arise in most of Caplan's examples so maybe it's just a labeling issue, I don't really know. It's also funny that Caplan mentions "pessimistic bias," since his book is itself so pessimistic!
On a similar point, he has a quote "ridiculing the 'abundance denial' of the developed world." I don't know what he's talking about! People in the U.S. have more cars, T.V.'s, etc, etc, than ever before! This doesn't look like "abundance denial" to me! Yes, there are poor people in the U.S., but on the average people consume a lot of material goods. Perhaps the problem here is that economist Caplan is judging psychological issues, whereas I (a political scientist) am trying to make an economic judgment.
In discussing the political consequences of his ideas, Caplan writes, "asymmetric information leads to less government." I see what he's saying, and this is a key part of his argument, but I don't know that this is often possible. For example, consider crime control. Ethnic-minority voters often don't trust the police, but having less police isn't considered a desirable outcome either. Similarly, if I don't think the government is doing enough to protect us from terrorism, I probably won't say the solution is to have a less active government. (Wanting less government protection from terrorism might be a legitimate view to hold, but it doesn't seem to me to be the natural view.)
To return to issues of psychology, Caplan correctly points out that preferences are unobservable. I'd go futher and say that latent preferences (and "utility functions") don't even exist. We construct our preferences as need be to solve particular problems (see Lichtenstein and Slovic, 2006). Caplan expresses surprise about "the political influence of great poets like Pablo Neruda"--why should people trust a poet's view on political issues? I think he's missing the point, which is that a poet can take a view that one might already have, but express it very well. More generally, celebrities symbolize a lot of things. I don't know why seeing Michael Jordan in an ad would make someone more likely to go to McDonald's, but they pay him a lot of money to create these associations.
One of the interesting things about this book is seeing an economist's perspective on issues of political psychology. Conversely, in discussing the views that political scientists and others have of economics, Caplan writes, "it is usually economists themselves who discover the exceptions [to 'market fundamentalism'] in the first place." Maybe it would be more accurate to write that some of these ideas are taken more seriously by economists, hence they take the trouble to note the exceptions. Scientists in other fields would often never even entertain "market fundamentalism" in the first place so they don't bother to refute it. For example, when I told my psychology-professor friend about my ideas on rational voting, he wasn't particularly interested because psychologists know all about how people are irrational. They don't see rationality as expected. I see rational-choice arguments as complementing rather than competing with psychological explanations of political behavior. Others have examined different ways in which such models are useful. For example, Caplan writes, "A worker could always offer to work for a reduced salary in exchange for more on-the-job safety," but Dorman (1996) argues convincingly that this does not actually happen.
It is too much to expect any player in the political system to be entirely rational--Ansolabehere and Snyder (2003) argue that even lobbyists are not particularly rational in their campaign contributions. Despite what is sometimes said, voting is not particularly irrational as compared to other other social and political activities. Voting has low cost and a very small chance of making a difference, but in that unlikely event, the difference can have huge repercussions nationally and globally; hence, the expected return from voting is arguably on the same order of magnitude as its cost (see Parfit, 1984, and Edlin, Gelman, and Kaplan, 2007).
Much of the work on the rationality of voting focuses on the decision of whether to vote, and which candidates to vote for. Caplan usefully switches the focus to policy, and he does a good job at exploring the implications of the fact that people don't have an economic motivation for being good voters. Even when they are voting rationally (by their own views), the feedback mechanism for updating their preferences is pretty weak.
I'm not so convinced by Caplan's arguments in favor of the alternatives of rule by business or rule by educated elites. I think his main argument (theoretical and practical problems with democracy) can be separated from some of his more debatable stances.
Ansolabehere, S., and Snyder, J. (2003). Why is there so little money in U.S. politics? Journal of Economic Perspectives 17, 105-130.
Dorman, P. (1996). Markets and Mortality: Economics, Dangerous Work, and the Value of Human Life. Cambridge University Press.
Edlin, A., Gelman, A., and Kaplan, N. (2007). Voting as a rational choice: why and how porple vote to improve the well-being of others. Rationality and Society 19, 293-314.
Hibbs, D. A. (2008). The implications of the "bread and peace" model for the 2008 US presidential election outcome. Public Choice 137, 1-10.
Huntington, S. P. (1975). The United States. In The Crisis of Democracy, ed. M. Crozier, S. Huntington, and J. Watanuki. New York University Press.
Lichtenstein, S., and Slovic, P., eds. (2006). The Construction of Preference. Cambridge University Press.
Parfit, D. (1984). Reasons and Persons. Oxford University Press.

Hopefully Anonymous December 28, 2009 12:58 PM Reply
"I'm not so convinced by Caplan's arguments in favor of the alternatives of rule by business or rule by educated elites."
This posture is more cowardly than you've been lately in my opinion.
That's a very limited universe of alternatives -I've seen a lot more and better in various comment threads in the academic blogosphere.
Andrew Gelman December 28, 2009 3:17 PM Reply
I'm not an expert on political theory and was not trying in my review to consider all the alternatives; my goal was to assess the arguments in Caplan's book. I'm sure that many in and out of the blogosphere can add a lot to the discussion beyond what I have to say. Caplan's book was based on public opinion research, which might be why I was asked to review it.
William Ockham December 28, 2009 4:16 PM Reply
Perhaps there's room for a book called "The Myth of the Rational Economist". Caplan overlooks many possible answers to the conundrum at the center of his book. He says that the general public and economists disagree about basic economic facts, therefore one of the two groups must be wrong. In reality, there are many other possible explanations. Both groups are made up of human beings and therefore both groups might suffer from systematic bias. As just one example, Caplan states that receiving a free washing machine is just the same as getting downsized because in both case society conserves valuable labor. This is just about the stupidest, most biased thing I have read this week. To fail to see how involuntary unemployment can corrode the social fabric is a blindness far worse than anything he accuses the public of.
Robin Hanson replied to comment from Andrew Gelman December 28, 2009 4:40 PM Reply
So you agree with pretty with everything Bryan says, except on topics outside what you consider your area of expertize, where you don't want to offer reasons for your disagreement? Seems you agree just about as much as Bryan could possibly hope for. :)
Andrew Gelman December 28, 2009 4:57 PM Reply
Yes, I think this was one thing I was getting at in my review, that Caplan at times jumps from statements about public opinion to his personal political attitudes.
I hope that Bryan is happy with the review; I think he did an excellent job in his book! But I do think it's useful to point out areas where he is extrapolating beyond his evidence. If you read my review carefully, you'll see that I do offer reasons for my disagreements with several specific points in his book.
When I write that "his main argument (theoretical and practical problems with democracy) can be separated from some of his more debatable stances," this is intended to be a positive statement. What I'm saying is that, even if you disagree with some of Caplan's strongly-held political views, this shouldn't cause you to dismiss his main argument. (To put it in logical notation, if Caplan's main argument is A and his political views are B, then he does not demonstrate that A -> B. Thus, disbelief in B is not a reason to disbelieve A.)


My Five Rules for Data Visualization

My Five Rules for Data Visualization
By Drew Conway, on December 3rd, 2009
Tonight the NYC R Meetup will be discussing data visualization in R using ggplot2. As part of tonight’s meeting I will be providing a very brief show and tell, which includes mostly code examples and external resources. This exercise has had me thinking quite a bit about data visualization. In addition, a few days ago the Security Crank (great new blog) pinged me on the apparent uselessness of network analysis visualizations in the defense and intelligence communities. As I say in my comment at SC, I agree; however, only in that the method is abused by those that view it as only a means to generate “pretty pictures.” All of this has touched off a very important point about data analysis; possibly the most important, which is how best to convey an analysis visually.
Consumers of data analytics are very rarely analysts themselves, so those in the business of generating plots, figures, chats, graphs, etc. most not only be expert in the analytical process, but also in choosing the best format and medium for relaying that knowledge to an audience. Admittedly, I am not Edward Tufte, Ben Fry, or David McCandless, but I have been around long enough to know what does and does not work, and as such here (in no particular order) are my five rules for data visualization.
1.The viz must be able to stand alone

This I learned early, after being dressed down multiple times while giving briefings to senior intelligence officers. Since then it has been reinforced while sitting in on failed job talks and conference presentations. The important thing to keep n mind is that when an audience sees a visualization it should be providing answers, not generating more questions.
This, to me, is the most difficult aspect of creating high quality data visualizations. As the creators we are often intimately familiar with the data, and thus take its subtleties for granted. Some people recommend asking yourself “would my Grandmother understand this,” but why insult Grandma’s intelligence? Here’s the bottom line: you have to decide the most efficient means of plotting the data (we’ll get to this), then you have a chart title, legend, possibly some axis labels, and if you are bold a short (140 characters is a good limit) footnote to get your point across. The best visualizations only require a subset of these to be effective, but once you have added the appropriate data accoutrements the chart better be self-explanatory. Very simple and imperfect example: restaurant tipping trends between men and women.

Why is the chart on the right better? First, it has more explanatory value. By splitting the data into two parts we are able to see the x-axis shift for men, i.e., in general they are tipping on higher bills. Also, we are able to use color in a more valuable way; rather than using it to distinguish between sex we can use it to highlight outliers and note general trends. Next, by reducing the amount of data in each plot the information is conveyed more efficiently. Finally, it achieves our ultimate goal, which is always to provide more answers than questions.
2.Have a diverse tool set

Learning the quirks and syntax of various data visualizations tools is time consuming and often frustrating, but if you want to create impressive charts you have to do it. I am very sorry to report, but Microsoft Excel + PowerPoint do not generate the best data visualizations. In fact, they often generate visualizations in the 10-20th percentile of quality. The question; therefore is: how do you find the best tools for your task?
Most of us will not have the resources to use professional data visualizations suites, but even so these tools are often limited by the scope and vision of their creators. Explore the open-source and general purpose data visualization options out there, learn the three best that fit your needs, and always be open to learning the new stuff—it will pay off.
3.People are terrible at distinguishing small differences

This could also be described as the “pie chart trap,” but clearly goes beyond that particular chart design. In fact, network visualizations are notorious for blurring subtle differences. For example, visualizations of massive amounts of social network data can be beautiful, but in nearly all cases they are much more art than science. If we are interested in telling a story with our data, and our data is large and complex, then we need to be creative about how to parse that complexity in order to enhance the clarity of our story. Example using networks: the structure of venture capital co-investments

The visualizations above examine the same data, and even use a similar technique to visualize it, but clearly the example on the right is conveying a more informative story. Admittedly, this visualization, which I generated, in many ways violates my first rule; however, it is still telling a story (e.g., there is a strong underlying structure among four notable communities of VC firms). The visualization on the left, taken from an initial attempt at analyzing this data, tells almost no story; save that the network is highly complex and there exist some disconnected firms.
4.Color selection matters

This would seem to be a self-evident point, but it may be the most often violated rule of quality visualization. It seems the primary reason for this problem is laziness, as the default color schemes in many visualization packages were not designed to convey information (again, see the left panel of the figure above). I recently violated this rule while putting together the slides for tonight’s R meetup. Using a single line of R code I generated this chart:

ggplot(subset(whiskey_brands,Brand!="Other brands")
,aes(x=Type, fill=Brand))+geom_bar(position="fill")

In my defense, I was first excited that there was a built-in Scotch whiskey dataset in R, but I also wanted to show what could be done with a single line of code. Clearly, however, the color scheme I used is taking away from the story. The default color scheme in ggplot2 wants to use a gradient, which may be useful in some cases, but not here. To improve the above example I should override this default and construct a more informative color scheme; such as setting a base color for each Scotch type (e.g., blue for blends and green for single malts).
5.Reduce, reuse, recycle

When developing statistical models we are often striving to specify the most “parsimonious” model, that is, the model that has the highest explanatory value-to-required variables ratio. We do this to reduce waste in our models, enhance our degrees of freedom, and provide a model that is most relevant to the data. The exact same rules apply to visualizations. Not all observations are created equally; therefore, they may not all belong in a visualization. Those who are analyzing large datasets take data reduction (or “munging”) as given, but in any visualization if something is not adding any value take it out. Developing new and meaningful methods for reducing data is a serious challenge, but one that should be considered before any attempt at visualization is done
On the other hand, if a reduction and/or visualization method has be successful in the past then it will likely b e successful in the future, so do not be afraid to reuse and recycle. Many of the most successful data visualizers have distinguished themselves by creating a method for visualization and sticking with it (think Gapminder). Not only will it possibly make you famous, but putting in the effort to create a useful method for combining, reducing and visualizing data will mean your efforts are more streamlined in the long term.
So that’s it. Nothing too profound there, but I wanted to post this in order to start a conversation. In that vein, what did I miss and where do you disagree? As always, I welcome your comments.
tweetcount_url='http://www.drewconway.com/zia/?p=1582';tweetcount_title='My Five Rules for Data Visualization ';tweetcount_cnt=46;tweetcount_size='small';

Must-Have R Packages for Social Scientists

After recently having to think critically about the value of various R packages for social science research, I realized that others might find value in a post on “must-have” R packages for social scientists. After the immensely popular post on this topic for Python packages a follow-up seemed appropraite. If you conduct social science research but are desperately clinging onto your SAS, SPSS or Matlab licenses; waiting for someone to convince you of R’s value, please allow me to be the first to try.
R is a functional programming language that allows for seamless data exploration, manipulation, analysis and visualization. The community using and supporting the language has exploded over the last several years, which has lead to the development of several immensely useful packages, many of which have direct application in the social sciences. Below are the R packages I use on a weekly/daily/monthly basis (in no particular order) and highly recommend to any R users; new or old.
Put simply, Zelig is a one-stop statistical shop for nearly all regression model specifications. Using a uniform syntax across model types, and several extremely useful plotting functions, the package’s autor Gary King (Political Science and Statistics at Harvard University) calls Zelig “everyone’s statistical software,” which is a very accurate description. if there is one R package that every social scientist should have it is Zelig!
Download Zelig
One of the advantages of R as a functional language is it contains a set of convenient base functions for plotting data. While useful when exploring a dataset, they are–for lack of a better word–ugly, and this is where ggplot2 comes in. Using the Grammar of Graphics manifesto as a guide, creator Hadley Wickham designed ggplot2 to “take the good parts of base and lattice graphics and none of the bad parts,” and he succeed. This is the premier R package for conveying your analysis visually.
Download ggplot2
I have combined the two competing network analysis packages in R into a single bullet because each has its strengths and weaknesses, and as such there is value in leaning and using both. The igraph package approaches network analysis from the mathematics/physics/graph theoretic perspective, including several advanced metrics and random graph models. In contrast, Statnet was primarily designed for social science, and its primary advantage is the inclusion of a series of functions for estimating and testing ERGM/p* graph models.
Download igraphDownload Statnet
Also brought to you by R guru Hadley Wickham, the plyr package assist reachers in the least glamorous aspect of their work—data manipulation and cleaning. One of R’s great advantages is its ability t handle very large datasets, and plyr is there to help you break these large data problems into smaller and more manageable pieces.
Download plyr
Amelia II
Also developed by Gary Kind, Amelia II contains a sets of algorithms for multiple imputation of missing data across a wide range of data types, such as survey, time series and cross sectional. As missing data problems are ubiquitous in social science research the functions contained in this package provide a powerful solution to these issues.
Download Amelia II
This package is used to fit and compare Gaussian linear and nonlinear mixed-effects models. For those examining complex time series data with various correlation structures the nlme provides a number of options for fits, tests and plotting.
Download nlme
Unlike newer version of Python, the current build of R does not contain native functionality for distributing jobs across high-performance computing clusters. The SNOW and Rmpi packages provide this functionality, and are highly recommended to any researcher with access to an HPC environment running R.
Download SNOWDownload Rmpi
Both of these packages convert R summary results into LaTeX/HTML table format. The xtable package is a general solution, while the apsrtable package, developed by fellow political science grad student Michael Malecki, will output tables in the APSR format&mdas;for those of you fortunate enough to need to use this format.
Download xtableDownload apsrtable
Got panel data? If so, you need plm, which contains all of the necessary model specifications and tests for fitting a panel data model; including specifications for instrumental variable models.
Download plm
As I stated, R is great for dealing with large datasets; however, occasionally you will encounter a dataset so large that it can grind R’s base I/O functions to a halt. As the name suggests, the sqldf packages overcomes this by allowing uses to perfrom SQL statements directly on R data frames, greatly increasing efficiency.
Download sqldf
I hope that you will explore and use the packages above that you do not already have familiarity with. To those who have never used R and/or have an irrational phobia of the language, let this list provide the appropriate motivation. Also, to those R experts out there, I welcome any suggestions for more useful R packages for the social science inclined!
tweetcount_url='http://www.drewconway.com/zia/?p=1614';tweetcount_title='Must-Have R Packages for Social Scientists';tweetcount_cnt=58;tweetcount_size='small';

Automatically Generated Related posts:
UPDATED: Must-Have Python Packages for Social Scientists
Visualizing Data with R and ggplot2 (Video)
SNA in R Talk, Updated with [Better] Video

Fake-data simulation as a research tool (Andrew Gelman)

I received the following email:
I was hoping if you could take a moment to counsel me on a problem that I'm having trying to calculate correct confidence intervals (I'm actually using a bootstrap method to simulate 95%CIs). . . . [What follows is a one-page description of where the data came from and the method that was used.]
My reply:
Without following all the details, let me make a quick suggestion which is that you try simulating your entire procedure on a fake dataset in which you know the "true" answer. You can then run your procedure and see if it works there. This won't prove anything but it will be a way of catching big problems, and it should also be helpful as a convincer to others.
If you want to carry this idea further, try to "break" your method by coming up with fake data that causes your procedure to give bad answers. This sort of simulation-and-exploration can be the first step in a deeper understanding of your method.
And then I got another, unrelated email from somebody else:
I am working on a mixed treatment comparison of treatments for non-small cell lung cancer. I am doing the analysis in two parts in order to estimate treatment effects (i.e. log hazard ratios) and absolute effects (by projecting the log hazard ratios onto a baseline treatment scale parameter; the baseline treatment times to event are assumed to arise from a Weibull distribution. . . . .[What follows is a one-page description of the model, which was somewhat complicated by constraints on some of the variance parameters] . . . I can get my analysis to run with constraints imposed on the treatment specific prior distributions for PFS and OS, and on the population log hazard ratios for PFS and OS. However, my proble is that the constraint does not appear to be doing anything and the results are similar to what I obtain without imposing the constraint. This is not what I expect . . .
My reply:
Sometimes the data are strong enough that essentially no information is supplied by external constraints. You can, to some extent, check how important this is for your problem by simulating some fake data from a setting similar to yours and then seeing whether your method comes close to reproducing the known truth. You can look at point estimates and also the coverage of posterior intervals.