2009年12月28日星期一

Economics and voter irrationality: my review of The Myth of the Rational Voter

By Andrew Gelman on December 27, 2009 4:02 PM

I recently reviewed Bryan Caplan's book, The Myth of the Rational Voter, for the journal Political Psychology. I wish I thought this book was all wrong, because then I could've titled my review, "The Myth of the Myth of the Rational Voter." But, no, I saw a lot of truth in Caplan's arguments. Here's what i wrote:
Bryan Caplan's The Myth of the Rational Voter was originally titled "The logic of collective belief: the political economy of voter irrationality," and its basic argument goes as follows:
(1) It is rational for people to vote and to make their preferences based on their views of what is best for the country as a whole, not necessarily what they think will be best for themselves individually.
(2) The feedback between voting, policy, and economic outcomes is weak enough that there is no reason to suppose that voters will be motivated to have "correct" views on the economy (in the sense of agreeing with the economics profession).
(3) As a result, democracy can lead to suboptimal outcomes--foolish policies resulting from foolish preferences of voters.
(4) In comparison, people have more motivation to be rational in their conomic decisions (when acting as consumers, producers, employers, etc). Thus it would be better to reduce the role of democracy and increase the role of the market in economic decision-making.
Caplan says a lot of things that make sense and puts them together in an interesting way. Poorly informed voters are a big problem in democracy, and Caplan makes the compelling argument that this is not necessarily a problem that can be easily fixed--it may be fundamental to the system. His argument differs from that of Samuel Huntington and others who claimed in the 1970s that democracy was failing because there was too much political participation. As I recall, the "too much democracy" theorists of the 1970s saw a problem with expectations: basically, there is just no way for "City Hall" to be accountable to everyone, thus they preferred limiting things to a more manageable population of elites. Caplan thinks that voting itself (not just more elaborate demands for governmental attention) is the problem.
Bounding the arguments
I have a bunch of specific comments on the book but first want to bound its arguments a bit.
First, Caplan focuses on economics, and specifically on economic issues that economists agree on. To the extent the economists disagree, the recommendations are less clear. For example, some economists prefer a strongly graduated income tax, others prefer a flat tax. Caplan would argue, I think, that tax rates in general should be lowered (since that would reduce the role of democratic government in the economic sphere) but it would still be up to Congress to decide the relative rates. This isn't a weakness of Caplan's argument; I'm just pointing out a limitation of its applicability. For another example, Caplan asks, "Why are inefficient policies like the minimum wage popular?" Isn't this a question of values? My impression is that some economists support a higher minimum wage, some don't.
More generally, non-economic issues--on which there is no general agrement by experts--spread into the economic sphere. Consider policies regarding national security, racial discrimination, and health care. Once again, I'm not saying that Caplan is wrong in his analysis of economic issues, just that democratic goverments do a lot of other things. (At one place he points out that the evidence shows that voters typically decide whom to vote for based on economic considerations; see, for example, Hibbs (2008). But, even thought the economy might be decisive on the margin, that doesn't mean these other issues don't matter.)
Another example is Caplan's discussion of toxicology, an area that I happen to have worked in. One of the difficulties is that people underestimate some risks and overestimate others. Thus, simple advice such as "worry" or "don't worry" aren't so helpful. Especially since there is typically a lag between exposure and health problems. In his discussion, Caplan ignores some of the political factors. One one side, industry has a lot of motivation to downplay the risks, and they do lots of lobbying in Congress. On the other hand, agencies such as the EPA sometimes are motivated to overstate risks. So these views don't occur in a vacuum.
Finally, Caplan generally consideres democracy as if it were direct. But I think representative democracy is much different than direct democracy. Caplan makes some mention of this, the idea that politicians have some "slack" in decision-making, but I suspect he is understating the importance of the role of the politicians in the decision-making process.
Specific comments
Later on, he writes, "What is the full price of ideological loyalty? It is the material wealth you forego in order to believe." I think that's part of it but not all. For example, suppose I have a false belief that the economic policy of party A will be good for the country. I (and others like me) vote for A, the party wins the election, implements the policy, and things get worse (compared to what would have happened had party B won). I will be a little unhappy to hear about the problems in the national economy. To the extent I care about others (and, as Caplan notes, that's why I'm voting in the first place, also probably a big motivation of why he wrote his book), if I have loyalty to a bad ideology, I'll pay the price in terms of a negative national outcome, even if I'm not personally affected.
Regarding the views of economists and others, I was surprised to see Caplan write, "Out of all the complaints that economists lodge against laymen, four families of beliefs stand out . . . antimarket bias, anti-foreign bias, make-work bias, and pessimistic bias." I'm surprised to hear this, because I thought that the two concepts that economists thought were most important (and ignored by noneconomists) were (a) opportunity cost, and (b) externalities. These two concepts arise in most of Caplan's examples so maybe it's just a labeling issue, I don't really know. It's also funny that Caplan mentions "pessimistic bias," since his book is itself so pessimistic!
On a similar point, he has a quote "ridiculing the 'abundance denial' of the developed world." I don't know what he's talking about! People in the U.S. have more cars, T.V.'s, etc, etc, than ever before! This doesn't look like "abundance denial" to me! Yes, there are poor people in the U.S., but on the average people consume a lot of material goods. Perhaps the problem here is that economist Caplan is judging psychological issues, whereas I (a political scientist) am trying to make an economic judgment.
In discussing the political consequences of his ideas, Caplan writes, "asymmetric information leads to less government." I see what he's saying, and this is a key part of his argument, but I don't know that this is often possible. For example, consider crime control. Ethnic-minority voters often don't trust the police, but having less police isn't considered a desirable outcome either. Similarly, if I don't think the government is doing enough to protect us from terrorism, I probably won't say the solution is to have a less active government. (Wanting less government protection from terrorism might be a legitimate view to hold, but it doesn't seem to me to be the natural view.)
To return to issues of psychology, Caplan correctly points out that preferences are unobservable. I'd go futher and say that latent preferences (and "utility functions") don't even exist. We construct our preferences as need be to solve particular problems (see Lichtenstein and Slovic, 2006). Caplan expresses surprise about "the political influence of great poets like Pablo Neruda"--why should people trust a poet's view on political issues? I think he's missing the point, which is that a poet can take a view that one might already have, but express it very well. More generally, celebrities symbolize a lot of things. I don't know why seeing Michael Jordan in an ad would make someone more likely to go to McDonald's, but they pay him a lot of money to create these associations.
One of the interesting things about this book is seeing an economist's perspective on issues of political psychology. Conversely, in discussing the views that political scientists and others have of economics, Caplan writes, "it is usually economists themselves who discover the exceptions [to 'market fundamentalism'] in the first place." Maybe it would be more accurate to write that some of these ideas are taken more seriously by economists, hence they take the trouble to note the exceptions. Scientists in other fields would often never even entertain "market fundamentalism" in the first place so they don't bother to refute it. For example, when I told my psychology-professor friend about my ideas on rational voting, he wasn't particularly interested because psychologists know all about how people are irrational. They don't see rationality as expected. I see rational-choice arguments as complementing rather than competing with psychological explanations of political behavior. Others have examined different ways in which such models are useful. For example, Caplan writes, "A worker could always offer to work for a reduced salary in exchange for more on-the-job safety," but Dorman (1996) argues convincingly that this does not actually happen.
Summary
It is too much to expect any player in the political system to be entirely rational--Ansolabehere and Snyder (2003) argue that even lobbyists are not particularly rational in their campaign contributions. Despite what is sometimes said, voting is not particularly irrational as compared to other other social and political activities. Voting has low cost and a very small chance of making a difference, but in that unlikely event, the difference can have huge repercussions nationally and globally; hence, the expected return from voting is arguably on the same order of magnitude as its cost (see Parfit, 1984, and Edlin, Gelman, and Kaplan, 2007).
Much of the work on the rationality of voting focuses on the decision of whether to vote, and which candidates to vote for. Caplan usefully switches the focus to policy, and he does a good job at exploring the implications of the fact that people don't have an economic motivation for being good voters. Even when they are voting rationally (by their own views), the feedback mechanism for updating their preferences is pretty weak.
I'm not so convinced by Caplan's arguments in favor of the alternatives of rule by business or rule by educated elites. I think his main argument (theoretical and practical problems with democracy) can be separated from some of his more debatable stances.
References
Ansolabehere, S., and Snyder, J. (2003). Why is there so little money in U.S. politics? Journal of Economic Perspectives 17, 105-130.
Dorman, P. (1996). Markets and Mortality: Economics, Dangerous Work, and the Value of Human Life. Cambridge University Press.
Edlin, A., Gelman, A., and Kaplan, N. (2007). Voting as a rational choice: why and how porple vote to improve the well-being of others. Rationality and Society 19, 293-314.
Hibbs, D. A. (2008). The implications of the "bread and peace" model for the 2008 US presidential election outcome. Public Choice 137, 1-10.
Huntington, S. P. (1975). The United States. In The Crisis of Democracy, ed. M. Crozier, S. Huntington, and J. Watanuki. New York University Press.
Lichtenstein, S., and Slovic, P., eds. (2006). The Construction of Preference. Cambridge University Press.
Parfit, D. (1984). Reasons and Persons. Oxford University Press.
Categories:
,

5 Comments
Hopefully Anonymous December 28, 2009 12:58 PM Reply
"I'm not so convinced by Caplan's arguments in favor of the alternatives of rule by business or rule by educated elites."
This posture is more cowardly than you've been lately in my opinion.
That's a very limited universe of alternatives -I've seen a lot more and better in various comment threads in the academic blogosphere.
Andrew Gelman December 28, 2009 3:17 PM Reply
I'm not an expert on political theory and was not trying in my review to consider all the alternatives; my goal was to assess the arguments in Caplan's book. I'm sure that many in and out of the blogosphere can add a lot to the discussion beyond what I have to say. Caplan's book was based on public opinion research, which might be why I was asked to review it.
William Ockham December 28, 2009 4:16 PM Reply
Perhaps there's room for a book called "The Myth of the Rational Economist". Caplan overlooks many possible answers to the conundrum at the center of his book. He says that the general public and economists disagree about basic economic facts, therefore one of the two groups must be wrong. In reality, there are many other possible explanations. Both groups are made up of human beings and therefore both groups might suffer from systematic bias. As just one example, Caplan states that receiving a free washing machine is just the same as getting downsized because in both case society conserves valuable labor. This is just about the stupidest, most biased thing I have read this week. To fail to see how involuntary unemployment can corrode the social fabric is a blindness far worse than anything he accuses the public of.
Robin Hanson replied to comment from Andrew Gelman December 28, 2009 4:40 PM Reply
So you agree with pretty with everything Bryan says, except on topics outside what you consider your area of expertize, where you don't want to offer reasons for your disagreement? Seems you agree just about as much as Bryan could possibly hope for. :)
Andrew Gelman December 28, 2009 4:57 PM Reply
William:
Yes, I think this was one thing I was getting at in my review, that Caplan at times jumps from statements about public opinion to his personal political attitudes.
Robin:
I hope that Bryan is happy with the review; I think he did an excellent job in his book! But I do think it's useful to point out areas where he is extrapolating beyond his evidence. If you read my review carefully, you'll see that I do offer reasons for my disagreements with several specific points in his book.
When I write that "his main argument (theoretical and practical problems with democracy) can be separated from some of his more debatable stances," this is intended to be a positive statement. What I'm saying is that, even if you disagree with some of Caplan's strongly-held political views, this shouldn't cause you to dismiss his main argument. (To put it in logical notation, if Caplan's main argument is A and his political views are B, then he does not demonstrate that A -> B. Thus, disbelief in B is not a reason to disbelieve A.)

2009年12月18日星期五

My Five Rules for Data Visualization

















My Five Rules for Data Visualization
By Drew Conway, on December 3rd, 2009
Tonight the NYC R Meetup will be discussing data visualization in R using ggplot2. As part of tonight’s meeting I will be providing a very brief show and tell, which includes mostly code examples and external resources. This exercise has had me thinking quite a bit about data visualization. In addition, a few days ago the Security Crank (great new blog) pinged me on the apparent uselessness of network analysis visualizations in the defense and intelligence communities. As I say in my comment at SC, I agree; however, only in that the method is abused by those that view it as only a means to generate “pretty pictures.” All of this has touched off a very important point about data analysis; possibly the most important, which is how best to convey an analysis visually.
Consumers of data analytics are very rarely analysts themselves, so those in the business of generating plots, figures, chats, graphs, etc. most not only be expert in the analytical process, but also in choosing the best format and medium for relaying that knowledge to an audience. Admittedly, I am not Edward Tufte, Ben Fry, or David McCandless, but I have been around long enough to know what does and does not work, and as such here (in no particular order) are my five rules for data visualization.
1.The viz must be able to stand alone





This I learned early, after being dressed down multiple times while giving briefings to senior intelligence officers. Since then it has been reinforced while sitting in on failed job talks and conference presentations. The important thing to keep n mind is that when an audience sees a visualization it should be providing answers, not generating more questions.
This, to me, is the most difficult aspect of creating high quality data visualizations. As the creators we are often intimately familiar with the data, and thus take its subtleties for granted. Some people recommend asking yourself “would my Grandmother understand this,” but why insult Grandma’s intelligence? Here’s the bottom line: you have to decide the most efficient means of plotting the data (we’ll get to this), then you have a chart title, legend, possibly some axis labels, and if you are bold a short (140 characters is a good limit) footnote to get your point across. The best visualizations only require a subset of these to be effective, but once you have added the appropriate data accoutrements the chart better be self-explanatory. Very simple and imperfect example: restaurant tipping trends between men and women.


Why is the chart on the right better? First, it has more explanatory value. By splitting the data into two parts we are able to see the x-axis shift for men, i.e., in general they are tipping on higher bills. Also, we are able to use color in a more valuable way; rather than using it to distinguish between sex we can use it to highlight outliers and note general trends. Next, by reducing the amount of data in each plot the information is conveyed more efficiently. Finally, it achieves our ultimate goal, which is always to provide more answers than questions.
2.Have a diverse tool set





Learning the quirks and syntax of various data visualizations tools is time consuming and often frustrating, but if you want to create impressive charts you have to do it. I am very sorry to report, but Microsoft Excel + PowerPoint do not generate the best data visualizations. In fact, they often generate visualizations in the 10-20th percentile of quality. The question; therefore is: how do you find the best tools for your task?
Most of us will not have the resources to use professional data visualizations suites, but even so these tools are often limited by the scope and vision of their creators. Explore the open-source and general purpose data visualization options out there, learn the three best that fit your needs, and always be open to learning the new stuff—it will pay off.
3.People are terrible at distinguishing small differences





This could also be described as the “pie chart trap,” but clearly goes beyond that particular chart design. In fact, network visualizations are notorious for blurring subtle differences. For example, visualizations of massive amounts of social network data can be beautiful, but in nearly all cases they are much more art than science. If we are interested in telling a story with our data, and our data is large and complex, then we need to be creative about how to parse that complexity in order to enhance the clarity of our story. Example using networks: the structure of venture capital co-investments


The visualizations above examine the same data, and even use a similar technique to visualize it, but clearly the example on the right is conveying a more informative story. Admittedly, this visualization, which I generated, in many ways violates my first rule; however, it is still telling a story (e.g., there is a strong underlying structure among four notable communities of VC firms). The visualization on the left, taken from an initial attempt at analyzing this data, tells almost no story; save that the network is highly complex and there exist some disconnected firms.
4.Color selection matters





This would seem to be a self-evident point, but it may be the most often violated rule of quality visualization. It seems the primary reason for this problem is laziness, as the default color schemes in many visualization packages were not designed to convey information (again, see the left panel of the figure above). I recently violated this rule while putting together the slides for tonight’s R meetup. Using a single line of R code I generated this chart:










data(whiskey,package="flexmix")
library(ggplot2)
ggplot(subset(whiskey_brands,Brand!="Other brands")
,aes(x=Type, fill=Brand))+geom_bar(position="fill")






In my defense, I was first excited that there was a built-in Scotch whiskey dataset in R, but I also wanted to show what could be done with a single line of code. Clearly, however, the color scheme I used is taking away from the story. The default color scheme in ggplot2 wants to use a gradient, which may be useful in some cases, but not here. To improve the above example I should override this default and construct a more informative color scheme; such as setting a base color for each Scotch type (e.g., blue for blends and green for single malts).
5.Reduce, reuse, recycle





When developing statistical models we are often striving to specify the most “parsimonious” model, that is, the model that has the highest explanatory value-to-required variables ratio. We do this to reduce waste in our models, enhance our degrees of freedom, and provide a model that is most relevant to the data. The exact same rules apply to visualizations. Not all observations are created equally; therefore, they may not all belong in a visualization. Those who are analyzing large datasets take data reduction (or “munging”) as given, but in any visualization if something is not adding any value take it out. Developing new and meaningful methods for reducing data is a serious challenge, but one that should be considered before any attempt at visualization is done
On the other hand, if a reduction and/or visualization method has be successful in the past then it will likely b e successful in the future, so do not be afraid to reuse and recycle. Many of the most successful data visualizers have distinguished themselves by creating a method for visualization and sticking with it (think Gapminder). Not only will it possibly make you famous, but putting in the effort to create a useful method for combining, reducing and visualizing data will mean your efforts are more streamlined in the long term.
So that’s it. Nothing too profound there, but I wanted to post this in order to start a conversation. In that vein, what did I miss and where do you disagree? As always, I welcome your comments.
tweetcount_url='http://www.drewconway.com/zia/?p=1582';tweetcount_title='My Five Rules for Data Visualization ';tweetcount_cnt=46;tweetcount_size='small';

Must-Have R Packages for Social Scientists

After recently having to think critically about the value of various R packages for social science research, I realized that others might find value in a post on “must-have” R packages for social scientists. After the immensely popular post on this topic for Python packages a follow-up seemed appropraite. If you conduct social science research but are desperately clinging onto your SAS, SPSS or Matlab licenses; waiting for someone to convince you of R’s value, please allow me to be the first to try.
R is a functional programming language that allows for seamless data exploration, manipulation, analysis and visualization. The community using and supporting the language has exploded over the last several years, which has lead to the development of several immensely useful packages, many of which have direct application in the social sciences. Below are the R packages I use on a weekly/daily/monthly basis (in no particular order) and highly recommend to any R users; new or old.
Zelig
Put simply, Zelig is a one-stop statistical shop for nearly all regression model specifications. Using a uniform syntax across model types, and several extremely useful plotting functions, the package’s autor Gary King (Political Science and Statistics at Harvard University) calls Zelig “everyone’s statistical software,” which is a very accurate description. if there is one R package that every social scientist should have it is Zelig!
Download Zelig
ggplot2
One of the advantages of R as a functional language is it contains a set of convenient base functions for plotting data. While useful when exploring a dataset, they are–for lack of a better word–ugly, and this is where ggplot2 comes in. Using the Grammar of Graphics manifesto as a guide, creator Hadley Wickham designed ggplot2 to “take the good parts of base and lattice graphics and none of the bad parts,” and he succeed. This is the premier R package for conveying your analysis visually.
Download ggplot2
Statnet/igraph
I have combined the two competing network analysis packages in R into a single bullet because each has its strengths and weaknesses, and as such there is value in leaning and using both. The igraph package approaches network analysis from the mathematics/physics/graph theoretic perspective, including several advanced metrics and random graph models. In contrast, Statnet was primarily designed for social science, and its primary advantage is the inclusion of a series of functions for estimating and testing ERGM/p* graph models.
Download igraphDownload Statnet
plyr
Also brought to you by R guru Hadley Wickham, the plyr package assist reachers in the least glamorous aspect of their work—data manipulation and cleaning. One of R’s great advantages is its ability t handle very large datasets, and plyr is there to help you break these large data problems into smaller and more manageable pieces.
Download plyr
Amelia II
Also developed by Gary Kind, Amelia II contains a sets of algorithms for multiple imputation of missing data across a wide range of data types, such as survey, time series and cross sectional. As missing data problems are ubiquitous in social science research the functions contained in this package provide a powerful solution to these issues.
Download Amelia II
nlme
This package is used to fit and compare Gaussian linear and nonlinear mixed-effects models. For those examining complex time series data with various correlation structures the nlme provides a number of options for fits, tests and plotting.
Download nlme
SNOW/Rmpi
Unlike newer version of Python, the current build of R does not contain native functionality for distributing jobs across high-performance computing clusters. The SNOW and Rmpi packages provide this functionality, and are highly recommended to any researcher with access to an HPC environment running R.
Download SNOWDownload Rmpi
xtable/apsrtable
Both of these packages convert R summary results into LaTeX/HTML table format. The xtable package is a general solution, while the apsrtable package, developed by fellow political science grad student Michael Malecki, will output tables in the APSR format&mdas;for those of you fortunate enough to need to use this format.
Download xtableDownload apsrtable
plm
Got panel data? If so, you need plm, which contains all of the necessary model specifications and tests for fitting a panel data model; including specifications for instrumental variable models.
Download plm
sqldf
As I stated, R is great for dealing with large datasets; however, occasionally you will encounter a dataset so large that it can grind R’s base I/O functions to a halt. As the name suggests, the sqldf packages overcomes this by allowing uses to perfrom SQL statements directly on R data frames, greatly increasing efficiency.
Download sqldf
I hope that you will explore and use the packages above that you do not already have familiarity with. To those who have never used R and/or have an irrational phobia of the language, let this list provide the appropriate motivation. Also, to those R experts out there, I welcome any suggestions for more useful R packages for the social science inclined!
tweetcount_url='http://www.drewconway.com/zia/?p=1614';tweetcount_title='Must-Have R Packages for Social Scientists';tweetcount_cnt=58;tweetcount_size='small';

Automatically Generated Related posts:
UPDATED: Must-Have Python Packages for Social Scientists
Visualizing Data with R and ggplot2 (Video)
SNA in R Talk, Updated with [Better] Video

Fake-data simulation as a research tool (Andrew Gelman)

I received the following email:
I was hoping if you could take a moment to counsel me on a problem that I'm having trying to calculate correct confidence intervals (I'm actually using a bootstrap method to simulate 95%CIs). . . . [What follows is a one-page description of where the data came from and the method that was used.]
My reply:
Without following all the details, let me make a quick suggestion which is that you try simulating your entire procedure on a fake dataset in which you know the "true" answer. You can then run your procedure and see if it works there. This won't prove anything but it will be a way of catching big problems, and it should also be helpful as a convincer to others.
If you want to carry this idea further, try to "break" your method by coming up with fake data that causes your procedure to give bad answers. This sort of simulation-and-exploration can be the first step in a deeper understanding of your method.
And then I got another, unrelated email from somebody else:
I am working on a mixed treatment comparison of treatments for non-small cell lung cancer. I am doing the analysis in two parts in order to estimate treatment effects (i.e. log hazard ratios) and absolute effects (by projecting the log hazard ratios onto a baseline treatment scale parameter; the baseline treatment times to event are assumed to arise from a Weibull distribution. . . . .[What follows is a one-page description of the model, which was somewhat complicated by constraints on some of the variance parameters] . . . I can get my analysis to run with constraints imposed on the treatment specific prior distributions for PFS and OS, and on the population log hazard ratios for PFS and OS. However, my proble is that the constraint does not appear to be doing anything and the results are similar to what I obtain without imposing the constraint. This is not what I expect . . .
My reply:
Sometimes the data are strong enough that essentially no information is supplied by external constraints. You can, to some extent, check how important this is for your problem by simulating some fake data from a setting similar to yours and then seeing whether your method comes close to reproducing the known truth. You can look at point estimates and also the coverage of posterior intervals.

2009年12月16日星期三

According to Microsoft, the fourth paradigm of science is data?

According to Microsoft, the fourth paradigm of science is data
In scientific discovery, the first three paradigms were experimental, theoretical and (more recently) computational science. A new book of essays published by Microsoft (and available for free download -- kudos, MS!) argues that a fourth paradigm of scientific discovery is at hand: the analysis of massive data sets. The book is dedicated to the late Microsoft researcher Dr Jim Gray, who pioneered the idea with the catchphrase: "It's the data, stupid". The basic idea is that our capacity for collecting scientific data has far outstripped our present capacity to analyze it, and so our focus should be on developing technologies that will make sense of this "Deluge of Data" (as this New York Times review of the book -- well worth a read -- calls it).
Dr Gray's call-to-arms was not to develop isolated super-powerful super-computers but “to have a world in which all of the science literature is online, all of the science data is online, and they interoperate with each other.” This dream is already close to a reality in some scientific domains like astronomy, where advanced instruments routinely generate petabytes of data available for public analysis. And with further developments in distributed and high-performance computing, with freely-available high-scale data management tools like Hadoop, and with advanced open-source data-analysis tools like R rapidly adapting to the scales of these data sets, the fourth paradigm is certain to become a mainstream reality in other scientific domains as well.
Microsoft Research: The Fourth Paradigm: Data-Intensive Scientific Discovery

Draw a scatterplot of age vs. attractiveness, using gender to define the points' colors.(纵横轴以外的变量颜色差别)

R code:> plot(d$age, d$attractive,col = ifelse(d$male, 'blue', 'deeppink'))


Statistics is big-N logic?

Statistics is big-N logic?
I think I believe one of these things, but I’m not quite sure.
Statistics is just like logic, except with uncertainty.
This would be true if statistics is Bayesian statistics and you buy the Bayesian inductive logic story — add induction to propositional logic, via a conditional credibility operator, and the Cox axioms imply standard probability theory as a consequence. (That is, probability theory is logic with uncertainty. And then a good Bayesian thinks probability theory and statistics are the same.) Links: Jaynes’ explanation; SEP article; also Fitelson’s article. (Though there are negative results; all I can think of right now is a Halpern article on Cox; and also interesting is Halpern and Koller.)
Secondly, here is another statement.
Statistics is just like logic, except with a big N.
This is a more data-driven view — the world is full of things and they need to be described. Logical rules can help you describe things, but you also have to deal with averages, correlations, and other things based on counting.
I don’t have any fancy cites or much thought yet in to this.
Here are two other views I’ve seen…
Johan van Benthem: probability theory is “logic with numbers”. I only saw this mentioned in passing in a subtitle of some lecture notes; this is not his official position or anything. Multi-valued and fuzzy logics can fit this description too. (Is fuzzy logic statistical? I don’t know much about it, other than that the Bayesians claim a weakness of fuzzy logic is that it doesn’t naturally relate to statistics.)
Manning and Schütze: statistics has to do with counting. (In one of the intro chapters of FSNLP). Statistics-as-counting seems more intriguing than statistics-as-aggregate-randomness.
Not sure how all these different possibilities combine or interact.

2 comments to “Statistics is big-N logic?”
Shawn wrote:26. March 2007 at 4:00 am :
Could you explain what “big N” means to those of us who are statistically ignorant (e.g. me)? Your statement is amazingly opaque to me as is.
Brendan wrote:27. March 2007 at 6:14 am :
Oh, I’m not sure it means much to anyone besides me :) Whenever there’s an experiment or a study, “N” often refers to the total number of data points (number of subjects, trials, samples, etc.) If you’re building an intelligent system, there’s a difference between having to support learning and inference over big datasets of many different things, or just small numbers of things. There is quite a history of sophisticated logical systems that fail to scale to real world data, while ludicrously simple statistical systems can be surprisingly robust.
The idea is that logical formalisms — boolean algebras, relations, functions, or frames and the like — are good at describing complex structure and relations for small sets of things. But if you have lots of things, you need to introduce abstractions involving counting — averages, covariance, correlations, and the like. The even more vague idea is that, given a nice axiomatic formalism, perhaps there is a way in which statistical notions result from having to consider learning/inference over large quantities of things.

There are approaches to probability theory (Cox/Jaynes as I understand it) that demonstrate it as a consequence of adding induction/uncertainty to boolean logic… I was wondering if there might be something analogous for statistics.

2009年12月15日星期二

Handy statistical lexicom

Handy statistical lexicon
By Andrew Gelman on May 24, 2009 10:29 PM 3 Comments
These are all important methods and concepts related to statistics that are not as well known as they should be. I hope that by giving them names, we will make the ideas more accessible to people:
Mister P: Multilevel regression and poststratification.
The Secret Weapon: Fitting a statistical model repeatedly on several different datasets and then displaying all these estimates together.
The Superplot: Line plot of estimates in an interaction, with circles showing group sizes and a line showing the regression of the aggregate averages.
The Folk Theorem: When you have computational problems, often there's a problem with your model.
The Pinch-Hitter Syndrome: People whose job it is to do just one thing are not always so good at that one thing.
Weakly Informative Priors: What you should be doing when you think you want to use noninformative priors.
P-values and U-values: They're different.
Conservatism: In statistics, the desire to use methods that have been used before.
WWJD: What I think of when I'm stuck on an applied statistics problem.
Theoretical and Applied Statisticians, how to tell them apart: A theoretical statistician calls the data x, an applied statistician says y.
The Fallacy of the One-Sided Bet: Pascal's wager, lottery tickets, and the rest.
Alabama First: Howard Wainer's term for the common error of plotting in alphabetical order rather than based on some more informative variable.
The USA Today Fallacy: Counting all states (or countries) equally, forgetting that many more people live in larger jurisdictions, and so you're ignoring millions and millions of Californians if you give their state the same space you give Montana and Delaware.
Second-Order Availability Bias: Generalizing from correlations you see in your personal experience to correlations in the population.
The "All Else Equal" Fallacy: Assuming that everything else is held constant, even when it's not gonna be.
The Self-Cleaning Oven: A good package should contain the means of its own testing.
The Taxonomy of Confusion: What to do when you're stuck.
The Blessing of Dimensionality: It's good to have more data, even if you label this additional information as "dimensions" rather than "data points."
Scaffolding: Understanding your model by comparing it to related models.
Ockhamite Tendencies: The irritating habit of trying to get other people to use oversimplified models.
Bayesian: A statistician who uses Bayesian inference for all problems even when it is inappropriate. I am a Bayesian statistician myself.
Multiple Comparisons: Generally not an issue if you're doing things right but can be a big problem if you sloppily model hierarchical structures non-hierarchically.
Taking a model too seriously: Really just another way of not taking it seriously at all.
God is in every leaf of every tree: No problem is too small or too trivial if we really do something about it.
I know there are a bunch I'm forgetting; can youall refresh my memory, please? Thanks.
P.S. No, I don't think I can ever match Stephen Senn in the definitions game.
Categories:
,

3 Comments
marcel May 25, 2009 4:54 PM Reply
In WWJD, you say, "My quick answer is, Yeah, I think it would be excellent for an econometrics class if the students have applied interests. Probably I'd just go through chapter 10 (regression, logistic regression, glm, causal inference), with the later parts being optimal."
So just skip the earlier parts?
Andrew Gelman May 25, 2009 7:23 PM Reply
Marcel: When I say "through chapter 10," I mean, "from chapters 1 through 10." And in the last sentence above, I meant "optional," not "optimal." I'll fix that.
jonathan May 26, 2009 11:22 AM Reply
Mister P, huh? Isn't that reflective of the old male dominant paradigm?

2009年12月12日星期六

Are Liberals Smarter Than Conservatives?

Are Liberals Smarter Than Conservatives?
Posted on: December 5, 2009 4:44 AM, by Andrew Gelman
Tom Ball writes:
Didn't know if you had seen this article [by Jason Richwine] about political allegiance and IQ but wanted to make sure you did. I'm surprised the author hasn't heard or seen of your work on Red and Blue states! What do you think?
I think the article raises some interesting issues but he seems to be undecided about whether to take the line that intelligent Americans mostly have conservative views ("[George W.] Bush's IQ is at least as high as John Kerry's" and "Even among the nation's smartest people, liberal elites could easily be in the minority politically") or the fallback position that, yes, maybe liberals are more intelligent than conservatives, but intelligence isn't such a good thing anyway ("The smartest people do not necessarily make the best political choices. William F. Buckley once famously declared that he would rather give control of our government to "the first 400 people listed in the Boston telephone directory than to the faculty of Harvard University."). One weakness of this latter argument is that the authorities he relies on for this point--William F. Buckley, Irving Kristol, etc.--were famous for being superintelligent. Richwine is in the awkward position of arguing that Saul Bellow's aunt (?) was more politically astute than Bellow, even though, in Kristol's words, "Saul's aunt may not have been a brilliant intellectual." Huh? We're taking Richwine's testimony on Saul Bellow's aunt's intelligence?
Richwine also gets into a tight spot when he associates conservativism as "following tradition" and liberalism with "non-traditional ideas." What is "traditional" can depend on your social setting. What it takes to be a rebel at the Columbia University faculty club is not necessarily what will get you thrown out of a country club in the Dallas suburbs. I think this might be what Tom Ball was thinking about when he referred to Red State, Blue State: political and cultural divisions mean different things in different places.
I do, however, agree with Richwine's general conclusion, which is that you're probably not going to learn much by comparing average IQ's of different groups. As Richwine writes, "The bottom line is that a political debate will never be resolved by measuring the IQs of groups on each side of the issue." African-Americans have low IQ's, on average, Jews have high IQ's on average, and both groups vote for the Democrats. Latinos have many socially conservative views but generally don't let those views get in the way of voting for Democrats.
Share this: Stumbleupon Reddit Email + More
a2a_linkname="Are Liberals Smarter Than Conservatives?";a2a_linkurl="http://scienceblogs.com/appliedstatistics/2009/12/are_liberals_smarter_than_cons.php";a2a_onclick=1;a2a_show_title=1;a2a_hide_embeds=0;a2a_prioritize=["aim","blogger_post","delicious","facebook","gmail","linkedin","newsvine","slashdot","sphere","stumbleupon","tumblr","twitter","yahoo_buzz","digg","friendfeed","livejournal","reddit","technorati_favorites","typepad_post","wordpress","yahoo_bookmarks"];
.
TrackBacks
TrackBack URL for this entry: http://scienceblogs.com/mt/pings/126457
Comments
1
For trivial amusement, people can play with the GSS:* POLVIEWS vs. WORDSUM.* REGION vs. POLVIEWS; REGION vs. WORDSUM, filter POLVIEWS(4).* POLVIEWS(1-3,5-7) vs. WORDSUM.
A more subtle question involves that quote about The smartest people do not necessarily make the best political choices. Is there a difference in scope and social distribution of the fallout for the types of mistakes each make? Which distribution is more harmful to society?
Andrew Gelman: Richwine also gets into a tight spot when he associates conservativism as "following tradition" and liberalism with "non-traditional ideas."
As opposed, say, to identifying conservatives as those who also use INGROUP, AUTHORITY, and PURITY for basis in moral intuitions. Essentially, Richwine looks to be trying to disown the Religious Right portion of conservatism so that the whole of conservatism's big tent membership (such as free market fans) do not get tainted by association.
Posted by: abb3w December 5, 2009 1:47 PM

2009年12月10日星期四

Linear Regression

Linear Regression
February 28, 2009
Definition:
Linear regression is a statistical tool used for forecasting future price. The concept behind linear regression is to find the best estimate of the trend given a noisy sample of data points. It is calculated by using the “Least Squares” method over a given period, which is drawn as a trendline extending through the defined period that attempts to filter out market noise.
You can add a Linear Regression Channel, which forms lines above and below the linear regression to help identify support and resistance.LR Channel can be added by clicking “Utilities” >> “Parameters” and selecting “LR Channel” in the resulting pop-up window. The default LR Channel is one standard deviation above and below the linear regression.
Interpretation:
There are two conventional interpretations for the linear regression line.
The first interpretation is to use the linear regression as the overall trendline for that given period. If the line is positive, it may suggest a buying opportunity, whereas a turn downwards suggests one may consider selling the stock. Price divergences below the line indicate a possible buying opportunity, for the market is oversold, while divergences above the line indicate the market is potentially overbought. Linear regression will work best when the period being studied is similar to the cycle length or typical trend length of the security in question.
A second interpretation is to construct a linear regression channel, consisting of two parallel lines at fixed distances above and below the linear regression line. These lines can be used as support and resistance lines, which are used to watch the battle between buyers and sellers.
Support and resistance lines are drawn as the upper and lower limits of a trading range, whereby the support line is the bottom line, and is the point at which “bulls” will not let the price fall below, and the resistance line is the top line, the point above which the “bears” will not let the price rise above.
Conventionally, a breakout above resistance or below support indicates that there is either a) some news about the company which justifies recreating the upper and lower trading limits or b) there is about to be a correction towards the range as trader\’s are hesitant about the stock\’s new value.
Using the Linear Regression Channel can assist in finding support and resistance levels from the Linear Regression.

Written by Larry Swing · Filed Under Indicators, Oscillators and Overlays