Planet R

Interestingness comparisons

at by Ron Pearson (aka TheNoodleDoodler) on R-Bloggers

(This article was first published on ExploringDataBlog, and kindly contributed to R-bloggers)

In three previous posts (April 3, 2011,  April 12, 2011,and May 21, 2011), I have discussed interestingness measures, which characterize the distributional heterogeneity of categorical variables.  Four specific measures are discussed in Chapter 3 of Exploring Data in Engineering, the Sciences and Medicine: the Bray measure, the Gini measure, the Shannon measure, and the Simpson measure.  All four of these measures vary from 0 to 1 in value, exhibiting their minimum value when all levels of the variable are equally represented, and exhibiting their maximum value when the variable is completely concentrated on a single one of its several possible levels.  Intermediate values correspond to variables that are more or less homogeneously distributed: more homogeneous for smaller values of the measure, and less homogeneous for larger values.  One of the points I noted in my first post on this topic was that the different measures exhibit different behavior for the intermediate cases, reflecting different inherent sensitivities to the various ways in which a variable can be “more homogeneous” or “less homogeneous.”  This post examines changes in interestingness measures as a potential exploratory analysis tool for selecting categorical predictors of some binary response.  In fact, I examined the same question from a different perspective in my April 12 post noted above: the primary difference is that there, the characterization I considered generates a single graph for each variable, with the number of points on the graph corresponding to the number of levels of the variable.  Here, I examine a characterization that represents each variable as a single point on the graph, allowing us to consider all variables simultaneously.



As a reminder of how these measures behave, the figure above shows a plot of the normalized Gini measure versus the normalized Shannon measure for the 23 categorical variables included in the mushroom dataset from the UCI Machine Learning Repository.  As I have noted in several previous posts that have discussed  this dataset, it gives observable characteristics for 8,124 mushrooms and classifies each one as either edible or poisonous (the binary variable EorP).  The above plot illustrates the systematic difference between the normalized Shannon and Gini interestingness measures: there, each point represents one of the 23 variables in the dataset, with the horizontal axis representing the Shannon measure computed for the variable and the vertical axis rperesenting the corresponding Gini measure.  The plot shows that the Gini measure is consistently larger than the Shannon measure, since all points lie above the equality reference line in this plot except for the single point at the origin.  This point corresponds to the variable VeilType, which only exhibits a single value in this dataset, meaning that both the Gini and Shannon measures are inherently ill-defined; consequently, they are given the default value of zero here, consistent with the general interpretation of these measures: if a variable only assumes a single value, it seems reasonable to consider it “completely homogeneous.”

Because edible and poisonous mushrooms are fairly evenly represented in this dataset (51.8% edible versus 48.2% poisonous), it has been widely used as one of several benchmarks for evaluating classification algorithms.  In particular, given the other mushroom characteristics, the fundamental classification question is how well can we predict whether each mushroom is poisonous or edible.  In this post and a subsequent follow-up post, I consider a closely related question: can differences in a variable’s interestingness measure between the edible subset and the poisonous subset be used to help us select prediction covariates for these classification algorithms?  In this post, I present some preliminary evidence to suggest that this may be the case, while in a subsequent post, I will put the question to the test by seeing how well the covariates suggested by this analysis actually predict edibility.

The specific idea I examine here is the following: given an interestingness measure and a mushroom characteristic, compute this measure for the chosen characteristic, applied the edible and poisonous mushrooms separately.  If these numbers are very different, this suggests that the distribution of levels is different for edible and poisonous mushrooms, further suggesting that this variable may be a useful predictor of edibility.  To turn this idea into a data analysis tool, it is necessary to define what we mean by “very different,” and this can be done in more than one way.  Here, I consider two possibilities.  The first is what I call the “normalized difference,” defined as the difference of the two interestingness measures divided by their sum.  Since both interestingness measures lie between 0 and 1, it is not difficult to show that this normalized difference lies between -1 and +1.  As a specific application of this idea, consider the plot below, which shows the normalized difference in the Gini measure between the poisonous mushrooms and the edible mushrooms (the normalized Gini shift) plotted against the corresponding difference for the Shannon measure (the normalized Shannon shift).  In addition, this plot shows an equality reference line, and the fact that the points consistently lie between this line and the horizontal axis shows that the normalized Gini shift is consistently smaller in magnitude than the normalized Shannon shift.  This suggests that the normalized Shannon measure may be more sensitive to distributional differences between edible and poisonous mushrooms.



The next figure, below, shows a re-drawn version of the above plot, with the equality reference line removed and replaced by four other reference lines.  The vertical dashed lines correspond to the outlier detection limits obtained by the Hampel identifier with threshold value t = 2 (see Chapter 7 of Exploring Data for a detailed discussion of this procedure), computed from the normalized Shannon shift values, while the horizontal dashed lines represent the corresponding limits computed from the normalized Gini shift values.  Points falling outside these limits represent variables whose changes in both Gini measure and Shannon measure are “unusually large” according to the Hampel identifier criteria used here.  These points are represented as solid circles, while those not detected as “unusual” by the Hampel identifier are represented as open circles.  The idea proposed here – to be investigated in a future post – is that these outlying variables may be useful in predicting mushroom edibility.



More specifically, the five solid circles in the above plot correspond to the following mushroom characteristics.  The two points in the lower left corner of the plot – exhibiting almost the most negative normalized Shannon shift possible – correspond to GillSize and StalkShape, two binary variables.  As I discussed in a previous post (May 7, 2011) and I discuss further in Chapter 13 of Exploring Data, an extremely useful measure of association between two binary variables (e.g., between GillSize and edibility) is the odds ratio.  An examination of the odds ratios for these two variables suggest that both should be at least somewhat predictive of edibility: the odds ratio between GillSize and edibility is 0.056, suggesting a very strong association (specifically, a GillSize value of “n” for “narrow” is most commonly associated with poisonous mushrooms in the UCI mushroom dataset), while the odds ratio between StalkShape and edibility is less extreme at 1.511, but still different enough from the neutral value of 1 to be suggestive of a clear association between these variables (a StalkShape value of “t” is more strongly associated with edible mushrooms than the alternative value of “e”).  The solid circle in the upper right of this plot corresponds to the variable CapSurf, which has four levels and whose distributional homogeneity appears to change quite substantially, according to both the Gini and Shannon measures.  Because this variable has more than two levels, it is not possible to characterize its association in terms of its odds ratio relative to edibility.  Finally, the cluster of three points in the upper right, just barely above the upper horizontal dashed line, correspond to the binary variables Bruises and GillSpace, and the six-level variable Pop.  Both of these binary variables exhibit very large odds ratios with respect to edibility (9.97 and 13.55 for Bruises and GillSpace, respectively), again suggesting that these variables may be highly predictive of edibility. 

The prevalence of binary variables in these results is noteworthy, and it reflects the fact that distributional shifts for binary variables can only occur in one way (i.e., the relative frequency of either fixed level can either increase or decrease).  Thus, large shifts in either interestingness measure should correspond to significant odds ratios with respect to the binary response variable, and this is seen to be the case here.  The situation is more complicated when a variable exhibits more than two levels, since the distribution of these levels can change in many ways between the two binary response values.  An important advantage of techniques like the the interestingness shift analysis described here is that they are not restricted to binary characteristics, as odds ratio characterizations are.

The second approach I consider for measuring the shift in interestingness between edible and poisonous mushrooms is what I call the “marginal measure,” corresponding to the difference in either the Gini or the Shannon measure between poisonous and edible mushrooms, divided by the original measure for the complete dataset.  An important difference between the marginal measure and the normalized measure is that the marginal measure is not bounded to lie between -1 and +1, as is evident in the plot below.  This plot shows the marginal Gini shift against the marginal Shannon shift for the mushroom characteristics, in the same format as the plot above.  Here, only four points are flagged as outliers, corresponding to the four binary variables identified above from the normalized shift plot: Bruises (the point in the extreme upper right), GillSpace (the point just barely in the upper right quadrant), and GillSize and StalkShape (the two points in the extreme lower left).  However, if we lower the Hampel identifier threshold from t = 2 to t = 1.5, we again identify CapSurf and Pop as potentially influential variables.



This last observation suggests an alternative interpretation approach that may be worth exploring.  Specifically, both of the two previous plots give clear visual evidence of “cluster structure,” and the Hampel identifier does extract some or all of this structure from the plot, but only if we apply a sufficiently judicious tuning to the threshold parameter.  A possible alternative would be to apply cluster analysis procedures, and this will be the subject of one or more subsequent posts.  In particular, there are many different clustering algorithms that could be applied to this problem, and the results are likely to be quite different.  The key practical question is which ones – if any – lead to useful ways of grouping these mushroom characteristics.  Subsequent posts will examine this question further from several different perspectives.

To leave a comment for the author, please follow the link and comment on his blog: ExploringDataBlog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Map of divorce in Mexico

at by Diego Valle-Jones on R-Bloggers

(This article was first published on Diego Valle's Blog, and kindly contributed to R-bloggers)

Keeping with this week's divorce theme, here's a map of the Mexican states where marriages are most likely to end in divorce. Perhaps not surprisingly, there seems to be an inverse correlation with the state percentage of the population that is catholic and the proportion of marriages which end in divorce. You can download the code to generate the map from my GitHub account.


To leave a comment for the author, please follow the link and comment on his blog: Diego Valle's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Average Annual Population Growth Rate of Tawi-Tawi

at by alstated on R-Bloggers

(This article was first published on ALSTAT R Blog, and kindly contributed to R-bloggers)

























R Codes


To leave a comment for the author, please follow the link and comment on his blog: ALSTAT R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

RcmdrPlugin.KMggplot2 _0.1-0 is on CRAN now

at by triadsou on R-Bloggers

(This article was first published on Triad sou., and kindly contributed to R-bloggers)

I posted a new version of the "RcmdrPlugin.KMggplot2" package, which is an Rcmdr plug-in for a "ggplot2" GUI front-end.

This package assists you to make "ggplot2" graphics.


NEWS

Changes in version 0.1-0 (2012-05-18)

theme_simple()

The theme_simple() is a simple black and white theme.

Top and right borders and both grids are removed from the theme_bw().

f:id:triadsou:20120518145842p:image


A list box for facet_xx() functions

A list box for facet_xx() functions was added.

f:id:triadsou:20120518145844p:image

f:id:triadsou:20120518145843p:image


Confidence intervals for Kaplan-Meier plots

f:id:triadsou:20120518152149p:image


Violin plots

f:id:triadsou:20120518145845p:image


Stacked bar charts

f:id:triadsou:20120518145847p:image


Univariate plots at diagonal positions (ggplot2::plotmatrix)

f:id:triadsou:20120518145846p:image


Related Posts

To leave a comment for the author, please follow the link and comment on his blog: Triad sou..

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

My experiences with Rcpp

at by mathiasbader on R-Bloggers

(This article was first published on Clustering epigenetic data using a Dirichlet process prior » R, and kindly contributed to R-bloggers)

The last seven days till Tuesday I have been working on the conversion of the code of my master thesis from scripted R (statistics) to compiled C++ using the Rcpp package from Dirk Eddelbuettel. Despite the initial effort necessary to set up the system (especially under Windows), I was looking forward to a huge speed-up of my simulation.

Setting up Rcpp

Setting up Rcpp under Windows is more or less straight-forward – there are just many small things you should take care of and it took me some time to figure out all of them. A good starting point is the Rcpp-FAQ that gives information on which software you’ll need. Luckily Duncan Murdoch provides the RTools package which puts together nearly everything you need. Due to license or size limitations, there are some tools still missing which you will need to install additionally. They are described in Appendix D – The Windows toolset of the R Installation and Administration manual. Be careful not to have spaces in the paths of the components. I especially want to emphasize that the German version of Windows 7 shows the “Program Files” folder as “Programme” which looks as though it doesn’t have a space. If you click on the address bar of the explorer though, you will see that “Programme” is just a link to the “Program Files” folder which actually has a space and therefore installing R there will not work (or at least didn’t work for me).

Converting R-code to C++

Converting my code from R to C++ was easier than I first thought. Using the inline method from Rcpp you can directly include C++ code as a string in R, have it compiled into a function and call it from R. The compiler error messages will be forwarded to R and displayed there which helps a lot debugging your code. Just some of the error messages are apparently not forwarded, for example if you try to access an std::vector-element with an index out of range, R will simply crash without any warning. Converting my simulation, this was the only error I found which R was not able to communicate with me.

Using Visual Studio with Rcpp

Even though Dirk Eddelbuettel and Romain François answer the question “Can I use Rcpp with Visual Studio” straightforwardly with “Not a chance”, I was using Visual Studio quite extensively for my development. It is true, that you won’t be able to compile your code with Rcpp, that is what you still need the toolchain from RTools for. But that doesn’t keep you from using Visual Studio for development. My solution looks as follows: I use a file dppClustering.cpp which I load and compile from R with the include function. In this file all variables are converted from Rcpp-variables into C++ variables. With these I then call my simulation-class that does contains the logic.

To develop with VS, instead of using dppClustering.cpp I created a new project that includes the simulation classes and accesses their functionality. With this set-up I am able to use the complete power of Visual Studio for my development, but I can still compile from within R using Rcpp.

How about the speed-up?

The runtime difference between R and C++ code is just mind-blowing. I averaged the runtime of 40 C++ runs and 2 R runs and calculated a speed-up of over 100.

The combination of fast implementation with R and additional runtime improvements using C++ with Rcpp for the computationally intensive parts of the code makes Rcpp an enormously powerful tool – the week I invested really payed off.


To leave a comment for the author, please follow the link and comment on his blog: Clustering epigenetic data using a Dirichlet process prior » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

R is to SAS as Java is to COBOL

at by David Smith on Revolutions

An interview with Revolution Analytics CEO Dave Rich was published this week by BeyeNetwork. During the interview, Dace was asked about how the statistical modeling platforms have changed over the decades:

People have been doing statistical modeling and predictive analytics for 50 years now, SAS and SPSS have been around since the early ‘70s. What’s different now -- what’s making this move toward other statistical and “big data” areas?

David Rich: Well, I think obviously SAS and SPSS have been around, as you pointed out, for decades. We call that sort of the first generation of analytics and insight-driven solutions. In my perspective, having been in the business for more than three decades, it reminds me a bit of what COBOL was back in the day relative to business software. I see R as the more modern language. In this analogy, R would represent Java or C++. What happened in the middle of the nineties when the shift occurred is very similar to where we are now with R. Open source is a worldwide collaboration innovation. It’s a way to tap into that channel for research, and I think the role that Revolution Analytics can play – very similar to what Red Hat did back in the Linux days – is to be the conduit between the community and enterprise deployment.

The conversation also touched on the future of big data analytics, impact of advanced analytics on business, and the benefits of R and Revolution R Enteprise to reduce costs and expand the scope of possibility with big data analytics. For the complete interview, follow the link below.

BeyeNetwork: Advanced Analytics, Big Data and the Power of R: A Q&A Spotlight with David Rich, CEO of Revolution Analytics

R is to SAS as Java is to COBOL

at by David Smith on R-Bloggers

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

An interview with Revolution Analytics CEO Dave Rich was published this week by BeyeNetwork. During the interview, Dace was asked about how the statistical modeling platforms have changed over the decades:

People have been doing statistical modeling and predictive analytics for 50 years now, SAS and SPSS have been around since the early ‘70s. What’s different now -- what’s making this move toward other statistical and “big data” areas?

David Rich: Well, I think obviously SAS and SPSS have been around, as you pointed out, for decades. We call that sort of the first generation of analytics and insight-driven solutions. In my perspective, having been in the business for more than three decades, it reminds me a bit of what COBOL was back in the day relative to business software. I see R as the more modern language. In this analogy, R would represent Java or C++. What happened in the middle of the nineties when the shift occurred is very similar to where we are now with R. Open source is a worldwide collaboration innovation. It’s a way to tap into that channel for research, and I think the role that Revolution Analytics can play – very similar to what Red Hat did back in the Linux days – is to be the conduit between the community and enterprise deployment.

The conversation also touched on the future of big data analytics, impact of advanced analytics on business, and the benefits of R and Revolution R Enteprise to reduce costs and expand the scope of possibility with big data analytics. For the complete interview, follow the link below.

BeyeNetwork: Advanced Analytics, Big Data and the Power of R: A Q&A Spotlight with David Rich, CEO of Revolution Analytics

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

In Mexico, more marriages ending in divorce, and sooner

at by David Smith on Revolutions

R user Diego Valle analyzed the rate of divorces in Mexican marriage since 1993 (the earliest date for which data are available) and found that not only have more marriages ended in divorce over time, but marriages that do end are ending sooner:

Mexico-divorces-by-length

This chart is a bit complicated, but it bears close inspection. Each line you see is a cohort of all of the marriages in a given year: 1993, 1994, all the way up to 2009. The vertical height of each line is proportional to the total number of divorces in each subsequent year within each cohort (expressed as a fraction of all marriages in the cohort year). Cleverly, the cohort lines are all arranged not by calendar time, but by years since marriage: the leftmost point represents divorces in the first year (relatively few), then divorces in the second year, and so on. 

More residents of Mexico married in 1993 saw their 10th wedding anniversary than those married in 1998. Overall, the trend is clear: more weddings that take place now will end than those from previous years, and they're likely to end sooner as well. Although there's not much historical data for recent marriage, the steady progression of divorce rates over time allows Diego to create a forecast (using a linear mixed-effects model in the R language) of the outcomes of recent marriages. He predicts, for example, that 11% of marriages registered in 2007 will have ended in divorce by 2022. By contrast though, that's about the same rate as US marriages from the fifties.

If you want to do a similar analysis, Diego provides R code in his post linked below, and at his github.

Diego Valle-Jones: Proportion of marriages ending in divorce

In Mexico, more marriages ending in divorce, and sooner

at by David Smith on R-Bloggers

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R user Diego Valle analyzed the rate of divorces in Mexican marriage since 1993 (the earliest date for which data are available) and found that not only have more marriages ended in divorce over time, but marriages that do end are ending sooner:

Mexico-divorces-by-length

This chart is a bit complicated, but it bears close inspection. Each line you see is a cohort of all of the marriages in a given year: 1993, 1994, all the way up to 2009. The vertical height of each line is proportional to the total number of divorces in each subsequent year within each cohort (expressed as a fraction of all marriages in the cohort year). Cleverly, the cohort lines are all arranged not by calendar time, but by years since marriage: the leftmost point represents divorces in the first year (relatively few), then divorces in the second year, and so on. 

More residents of Mexico married in 1993 saw their 10th wedding anniversary than those married in 1998. Overall, the trend is clear: more weddings that take place now will end than those from previous years, and they're likely to end sooner as well. Although there's not much historical data for recent marriage, the steady progression of divorce rates over time allows Diego to create a forecast (using a linear mixed-effects model in the R language) of the outcomes of recent marriages. He predicts, for example, that 11% of marriages registered in 2007 will have ended in divorce by 2022. By contrast though, that's about the same rate as US marriages from the fifties.

If you want to do a similar analysis, Diego provides R code in his post linked below, and at his github.

Diego Valle-Jones: Proportion of marriages ending in divorce

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Should I be nice?

at by Karl Broman on R-Bloggers

(This article was first published on The stupidest thing... » R, and kindly contributed to R-bloggers)

I got the following email.

Subject: i have a question?
Date: May 18, 2012 7:57:56 AM CDT

how can i enter the data of QTL analysis.

That was the whole thing.

I presume that the writer wishes to use my R/qtl software.

I could probably respond helpfully (for example, “See the sample data files and code at the R/qtl web site.”), but can’t I expect more people from seeking my help?

I suppose there are three options:

  • Just answer the inferred question.
  • Answer the inferred question, but in an offensive way.
  • Ask the writer to provide further details.

I chose the third option, but I probably should have chosen the first. A fabulously useful person who I greatly admire is well known for his use of the middle option.


Update:

I responded

Your question is not answerable without further details.

to which the correspondent replied

How can i analysis data the by RQTL?

I responded with links to tutorials, sample data files, and my book with Śaunak Sen.


To leave a comment for the author, please follow the link and comment on his blog: The stupidest thing... » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

A minimal network example in R

at by sieste on R-Bloggers

(This article was first published on sieste » R, and kindly contributed to R-bloggers)

Network science is potentially useful for certain problems in data analysis, and I know close to nothing about it.

In this short post I present my first attempt at network analysis: A minimal example to construct and visualize an artificial undirected network with community structure in R. No network libraries are loaded. Only basic R-functions are used.

The final product of this post is this plot:

I am going to walk you through the code to produce the above plots. The complete code is given at the bottom of the post.

The starting point

First of all the network specification:

#network specs: K communities, N nodes, n[1] of them have 
#label 1, etc..., p(link inside)=p.in, p(link outside)=p.out
K     <- 3
N     <- 40 
n     <- c(rep(floor(N/K),K-1),floor(N/K)+N%%K)
labls <- rep(seq(K),n)
p.in  <- .9
p.out <- .1

pairs	   <- expand.grid(seq(N),seq(N))
uniq.pairs <- pairs[which(pairs[,1]<pairs[,2]),]
uniq.pairs <-lapply(apply(uniq.pairs,1,list),unlist) # I hate R for this line

There are N nodes and K communities. n[1] nodes are in community 1, n[2] in community 2, etc.

The probability p.in is the link probability between two nodes that are in the same community and p.out is the link probability across communities.

The variable labls contains the label of each node.

The variable pairs contains all possible tuples of the set (1,…,N). uniq.pairs is a list of 2-element vectors, containing only the unique links between nodes.

The p-matrix

Such a network with community structure is typically modeled by what I would call a p-matrix (the technical term is “stochastic block matrix”). The entry in the i-th row and j-th column tells me the probability that node i is connected to node j. Since the network is undirected, a link between i and j is equivalent to a link between j and i, hence the p-matrix is symmetric. Furthermore no loops are allowed here, so the diagonal elements of the p-matrix are zero.

Community structure leads to the p-matrix being block-diagonal with large constant entries p.in in the blocks across the diagonal and small or vanishing entries p.out outside of these blocks.

#plot the block matrix
plot(NULL,xlim=c(1,N),ylim=c(1,N),asp=1,axes=F,xlab=NA,ylab=NA, main="p-matrix")
lapply( uniq.pairs, function(z) {
    colr <- ifelse(labls[z[1]]==labls[z[2]],gray(1-p.in), gray(1-p.out))
    points(z,N-z[c(2,1)],pch=15,col=colr) } )

The adjacency list

Based on the network specifications, an adjacency list is constructed. The adjacency list is the list of links in the network. A link between nodes i and j is realized with probability p.in if the nodes are in the same community, and with probability p.out otherwise.

#sample an adjacency list
adj.list   <- lapply(uniq.pairs,function(z) {
		p <- ifelse( labls[z[1]]==labls[z[2]],p.in,p.out)
		ifelse(runif(1)<p,1,0)*z})
adj.list   <- adj.list[ which(lapply(adj.list,sum)>0) ]

One way of visualizing the network is to plot the adjacency matrix which has dimension N*N and has a one at index (i,j) if nodes i and j are connected and a zero otherwise:

#plot the adjacency matrix
plot(NULL,xlim=c(1,N),ylim=c(1,N),asp=1,axes=F,xlab=NA,ylab=NA, main="adjacency matrix")
lapply( adj.list, function(z) {
    points(z,N-z[c(2,1)],pch=15) } )

Visualizing the links

Another way of visualizing the network is to plot the individual nodes and connect them with lines according to their links:

#plot the network
plot(NULL,xlim=c(0,K),ylim=c(0,1),axes=F,xlab=NA,ylab=NA, main="network")
coords <- matrix(runif(2*N),ncol=2)+cbind(labls-1,rep(0,N))
lapply( adj.list, function(z) {
         x <- coords[ z,1 ]
         y <- coords[ z,2 ]
         lines(x,y,col="#55555544") } )
colrs <- rainbow(K)
points( coords, pch=15, col=colrs[ labls ])

So far so good

This is it. As I said, it was the first step.

Next I want to look at ways to guess the node labels (the community structure) only based on the network structure, so stay tuned.

Thanks for watching.

———————-

The complete code:

######################################################
# construct and plot a stochastic block model, and a 
# corresponding network with community structure
######################################################

################################################################
#network specs: K communities, N nodes, n[1] of them have 
#label 1, etc..., p(link inside)=p.in, p(link outside)=p.out
################################################################
N     <- 40
K     <- 3
n     <- c(rep(floor(N/K),K-1),floor(N/K)+N%%K)
labls <- rep(seq(K),n)
p.in  <- .9
p.out <- .1

##############################################################
#sample an adjacency list
# pairs (uniq.pairs)  ... list of possible (unique) node pairs
# adj.list	      ... list of links between nodes
##############################################################
pairs	   <- expand.grid(seq(N),seq(N))
uniq.pairs <- pairs[which(pairs[,1]<pairs[,2]),]
uniq.pairs <-lapply(apply(uniq.pairs,1,list),unlist) # I hate R for this line
adj.list   <- lapply(uniq.pairs,function(z) {
		p <- ifelse( labls[z[1]]==labls[z[2]], p.in,p.out)
		ifelse(runif(1)<p,1,0)*z})
adj.list   <- adj.list[ which(lapply(adj.list,sum)>0) ]

#everything in one plot
jpeg("network.jpg",width=500,height=500,quality=100)
par(mar=c(2,2,4,2),cex.main=1.5)
layout(t(matrix(c(1,2,3,3),nrow=2)))

#plot the block matrix
plot(NULL,xlim=c(1,N),ylim=c(1,N),asp=1,axes=F,xlab=NA,ylab=NA, main="p-matrix")
lapply( uniq.pairs, function(z) {
    colr <- ifelse(labls[z[1]]==labls[z[2]],gray(1-p.in), gray(1-p.out))
    points(z,N-z[c(2,1)],pch=15,col=colr) } )

#plot the adjacency matrix
plot(NULL,xlim=c(1,N),ylim=c(1,N),asp=1,axes=F,xlab=NA,ylab=NA, main="adjacency matrix")
lapply( adj.list, function(z) {
    points(z,N-z[c(2,1)],pch=15) } )

#plot the network
plot(NULL,xlim=c(0,K),ylim=c(0,1),axes=F,xlab=NA,ylab=NA, main="network")
coords <- matrix(runif(2*N),ncol=2)+cbind(labls-1, rep(0,N))
lapply( adj.list, function(z) {
         x <- coords[ z,1 ]
         y <- coords[ z,2 ]
         lines(x,y,col="#55555544") } )
colrs <- rainbow(K)
points( coords, pch=15, col=colrs[ labls ])
dev.off()


To leave a comment for the author, please follow the link and comment on his blog: sieste » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Example Reproducile Report using R Markdown: Analysis of California Schools Test Data

at by Jeromy Anglim on R-Bloggers

(This article was first published on Jeromy Anglim's Blog: Psychology and Statistics, and kindly contributed to R-bloggers)

This is a quick set of analyses of the California Test Score dataset. The post was produced using R Markdown in RStudio 0.96. The main purpose of this post is to provide a case study of using R Markdown to prepare a quick reproducible report. It provides examples of using plots, output, in-line R code, and markdown. The post is designed to be read along side the R Markdown source code, which is available as a gist on github.

opts_knit$set(upload.fun = imgur_upload)

Preliminaries

  • This post builds on my earlier post which provided a guide for [Getting Started with R Markdown, knitr, and RStudio 0.96](jeromyanglim.blogspot.com/2012/05/getting-started-with-r-markdown-knitr.html)
  • The dataset analysed comes from the AER package which is an accompaniment to the book Applied Econometrics with R written by Christian Kleiber and Achim Zeileis.

Load packages and data

# if necessary uncomment and install packages.  install.packages('AER')
# install.packages('psych') install.packages('Hmisc')
# install.packages('ggplot2') install.packages('relaimpo')
library(AER) # interesting datasets
library(psych) # describe and psych.panels
library(Hmisc) # describe
library(ggplot2) # plots: ggplot and qplot
library(relaimpo) # relative importance in regression
# load the California Schools Dataset and give the dataset a shorter name
data(CASchools)
cas <- CASchools

# Convert grade to numeric

# table(cas$grades)
cas$gradesN - cas$grades == "KK-08"

# Get the set of numeric variables
v <- setdiff(names(cas), c("district", "school", "county", "grades"))

Q 1 What does the CASchools dataset involve?

Quoting the help (i.e., ?CASchools), the data is “from all 420 K-6 and K-8 districts in California with data available for 1998 and 1999” and the variables are:

* district: character. District code.
* school: character. School name.
* county: factor indicating county.
* grades: factor indicating grade span of district.
* students: Total enrollment.
* teachers: Number of teachers.
* calworks: Percent qualifying for CalWorks (income assistance).
* lunch: Percent qualifying for reduced-price lunch.
* computer: Number of computers.
* expenditure: Expenditure per student.
* income: District average income (in USD 1,000).
* english: Percent of English learners.
* read: Average reading score.
* math: Average math score.

Let's look at the basic structure of the data frame. i.e., the number of observations and the types of values:

str(cas)
## 'data.frame':    420 obs. of  15 variables:
## $ district : chr "75119" "61499" "61549" "61457" ...
## $ school : chr "Sunol Glen Unified" "Manzanita Elementary" "Thermalito Union Elementary" "Golden Feather Union Elementary" ...
## $ county : Factor w/ 45 levels "Alameda","Butte",..: 1 2 2 2 2 6 29 11 6 25 ...
## $ grades : Factor w/ 2 levels "KK-06","KK-08": 2 2 2 2 2 2 2 2 2 1 ...
## $ students : num 195 240 1550 243 1335 ...
## $ teachers : num 10.9 11.1 82.9 14 71.5 ...
## $ calworks : num 0.51 15.42 55.03 36.48 33.11 ...
## $ lunch : num 2.04 47.92 76.32 77.05 78.43 ...
## $ computer : num 67 101 169 85 171 25 28 66 35 0 ...
## $ expenditure: num 6385 5099 5502 7102 5236 ...
## $ income : num 22.69 9.82 8.98 8.98 9.08 ...
## $ english : num 0 4.58 30 0 13.86 ...
## $ read : num 692 660 636 652 642 ...
## $ math : num 690 662 651 644 640 ...
## $ gradesN : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
# Hmisc::describe(cas) # For more extensive summary statistics

Q. 2 To what extent does expenditure per student vary?

qplot(expenditure, data = cas) + xlim(0, 8000) + xlab("Money spent per student ($)") + 
ylab("Count of schools")

plot of chunk cas2


round(t(psych::describe(cas$expenditure)), 1)
##            [,1]
## var 1.0
## n 420.0
## mean 5312.4
## sd 633.9
## median 5214.5
## trimmed 5252.9
## mad 487.2
## min 3926.1
## max 7711.5
## range 3785.4
## skew 1.1
## kurtosis 1.9
## se 30.9

The greatest expenditure per student is around double that of the least expenditure per student.

Q. 3a What predicts expenditure per student?

# Compute and format set of correlations
corExp <- cor(cas["expenditure"], cas[setdiff(v, "expenditure")])
corExp <- round(t(corExp), 2)
corExp[order(corExp[, 1], decreasing = TRUE), , drop = FALSE]
##          expenditure
## income 0.31
## read 0.22
## math 0.15
## calworks 0.07
## lunch -0.06
## computer -0.07
## english -0.07
## teachers -0.10
## students -0.11
## gradesN -0.17

More is spent per student in schools :

  1. where people with greater incomes live
  2. reading scores are higher
  3. that are K-6

Q. 4 what is the relationship between district level maths and reading scores?

ggplot(cas, aes(read, math)) + geom_point() + geom_smooth()

plot of chunk cas4

At the district level, the correlation is very strong (r = The correlation is 0.92). From prior experience I'd expect correlations at the individual-level in the .3 to .6 range. Thus, these results are consistent with group-level relationships being much larger than individual-level relationships.

Q. 5 What is the relationship between maths and reading after partialling out other effects?

# command has strange syntax requiring column numbers rather than variable
# names
partial.r(cas[v], c(which(names(cas[v]) == "read"), which(names(cas[v]) ==
"math")), which(!names(cas[v]) %in% c("read", "math")))
## partial correlations 
## read math
## read 1.00 0.72
## math 0.72 1.00

The partial correlation is still very strong but is substantially reduced.

Q. 6 What fraction of a computer does each student have?

cas$compstud - cas$computer/cas$students
describe(cas$compstud)
## cas$compstud 
## n missing unique Mean .05 .10 .25 .50 .75
## 420 0 412 0.1359 0.05471 0.06654 0.09377 0.12546 0.16447
## .90 .95
## 0.22494 0.24906
##
## lowest : 0.00000 0.01455 0.02266 0.02548 0.04167
## highest: 0.32770 0.34359 0.34979 0.35897 0.42083
qplot(compstud, data = cas)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-4

The mean number of computers per student is 0.136.

Q. 7 What is a good model of the combined effect of other variables on academic performance (i.e., math and read)?

# Examine correlations between variables
psych::pairs.panels(cas[v])

plot of chunk cas7

pairs.panels shows correlations in the upper triangle, scatterplots in the lower triangle, and variable names and distributions on the main diagonal.
After examining the plot several ideas emerge.

# (a) students is a count and could be log transformed
cas$studentsLog - log(cas$students)

# (b) teachers is not the variable of interest:
# it is the number of students per teacher
cas$studteach - cas$students /cas$teachers
# (c) computers is not the variable of interest:
# it is the ratio of computers to students
# table(cas$computer==0)
# Note some schools have no computers so ratio would be problematic.
# Take percentage of a computer instead
cas$compstud - cas$computer / cas$students

# (d) math and reading are correlated highly, reduce to one variable
cas$performance <- as.numeric(
scale(scale(cas$read) + scale(cas$math)))

Normally, I'd add all these transformations to an initial data transformation file that I call in the first block, but for the sake of the narrative, I'll leave them here.

Let's examine correlations between predictors and outcome.

m1cor <- cor(cas$performance, cas[c("studentsLog", "studteach", "calworks", 
"lunch", "compstud", "income", "expenditure", "gradesN")])
t(round(m1cor, 2))
##              [,1]
## studentsLog -0.12
## studteach -0.23
## calworks -0.63
## lunch -0.87
## compstud 0.27
## income 0.71
## expenditure 0.19
## gradesN -0.16

Let's examine the multiple regression.

m1 <- lm(performance ~ studentsLog + studteach + calworks + lunch + 
compstud + income + expenditure + grades, data = cas)
summary(m1)
## 
## Call:
## lm(formula = performance ~ studentsLog + studteach + calworks +
## lunch + compstud + income + expenditure + grades, data = cas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8107 -0.2963 -0.0118 0.2712 1.5662
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.99e-01 4.98e-01 1.80 0.072 .
## studentsLog -3.83e-02 1.91e-02 -2.01 0.045 *
## studteach -1.11e-02 1.59e-02 -0.70 0.487
## calworks 1.96e-03 2.96e-03 0.66 0.508
## lunch -2.65e-02 1.48e-03 -17.97 < 2e-16 ***
## compstud 7.88e-01 3.86e-01 2.04 0.042 *
## income 2.82e-02 4.89e-03 5.77 1.6e-08 ***
## expenditure 5.87e-05 4.90e-05 1.20 0.232
## gradesKK-08 -1.21e-01 6.49e-02 -1.87 0.062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.457 on 411 degrees of freedom
## Multiple R-squared: 0.795, Adjusted R-squared: 0.791
## F-statistic: 199 on 8 and 411 DF, p-value: <2e-16
##

And some indicators of predictor relative importance.

# calc.relimp from relaimpo package.
(m1relaimpo <- calc.relimp(m1, type = "lmg", rela = TRUE))
## Response variable: performance 
## Total response variance: 1
## Analysis based on 420 observations
##
## 8 Regressors:
## studentsLog studteach calworks lunch compstud income expenditure grades
## Proportion of variance explained by model: 79.48%
## Metrics are normalized to sum to 100% (rela=TRUE).
##
## Relative importance metrics:
##
## lmg
## studentsLog 0.009973
## studteach 0.016695
## calworks 0.177666
## lunch 0.492866
## compstud 0.025815
## income 0.251769
## expenditure 0.014785
## grades 0.010432
##
## Average coefficients for different model sizes:
##
## 1X 2Xs 3Xs 4Xs 5Xs
## studentsLog -0.08771 -0.0650133 -0.0558756 -0.0519312 -4.926e-02
## studteach -0.11918 -0.0861199 -0.0629499 -0.0462155 -3.372e-02
## calworks -0.05473 -0.0427576 -0.0324658 -0.0233760 -1.535e-02
## lunch -0.03199 -0.0310310 -0.0301497 -0.0293300 -2.856e-02
## compstud 4.15870 3.0673338 2.2639604 1.6844348 1.287e+00
## income 0.09860 0.0850555 0.0726892 0.0614726 5.140e-02
## expenditure 0.00030 0.0001986 0.0001374 0.0001013 8.061e-05
## grades -0.45677 -0.3345683 -0.2529014 -0.1981200 -1.628e-01
## 6Xs 7Xs 8Xs
## studentsLog -4.626e-02 -4.252e-02 -3.833e-02
## studteach -2.418e-02 -1.687e-02 -1.109e-02
## calworks -8.399e-03 -2.612e-03 1.962e-03
## lunch -2.785e-02 -2.718e-02 -2.654e-02
## compstud 1.034e+00 8.828e-01 7.884e-01
## income 4.250e-02 3.477e-02 2.821e-02
## expenditure 6.882e-05 6.206e-05 5.871e-05
## grades -1.414e-01 -1.291e-01 -1.215e-01

Thus, we can conclude that:

  1. Income and indicators of income (e.g., low levels of lunch vouchers) are the two main predictors. Thus, schools with greater average income tend to have better student performance.
  2. Schools with more computers per student have better student performance.
  3. Schools with fewer students per teacher have better student performance.

For more information about relative importance and the relaimpo package measures check out Ulrike Grömping's website.
Of course this is all observational data with the usual caveats regarding causal interpretation.

Now, let's look at some weird stuff.

Q. 8.1 What are common words in Californian School names?

# create a vector of the words that occur in school names
lw <- unlist(strsplit(cas$school, split = " "))

# create a table of the frequency of school names
tlw <- table(lw)

# extract cells of table with count greater than 3
tlw2 <- tlw[tlw > 3]

# sorted in decreasing order
tlw2 <- sort(tlw2, decreasing = TRUE)

# values as proporitions
tlw2p <- round(tlw2/nrow(cas), 3)

# show this in a bar graph
tlw2pdf <- data.frame(word = names(tlw2p), prop = as.numeric(tlw2p),
stringsAsFactors = FALSE)
ggplot(tlw2pdf, aes(word, prop)) + geom_bar() + coord_flip()

plot of chunk unnamed-chunk-8

# make it log counts
ggplot(tlw2pdf, aes(word, log(prop * nrow(cas)))) + geom_bar() +
coord_flip()

plot of chunk unnamed-chunk-9

The word “Elementary” appears in almost all school names (98.3%). The word “Union” appears in around half (43.3%).

Other common words pertain to:

  • Directions (e.g., South, West),
  • Features of the environment (e.g., Creek, Vista, View, Valley)
  • Spanish words (e.g., rio for river; san for saint)

Q. 8.2 Is the number of letters in the school's name related to academic performance?

cas$namelen - nchar(cas$school)
table(cas$namelen)
## 
## 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39
## 1 4 9 26 28 31 33 27 30 45 38 28 36 30 18 10 5 4 6 3 1 2 2 2 1
round(cor(cas$namelen, cas[, c("read", "math")]), 2)
##      read math
## [1,] 0.03 0

The answer appears to be “no”.

Q. 8.3 Is the number of words in the school name related to academic performance?

cas$nameWordCount - sapply(strsplit(cas$school, " "), length)
table(cas$nameWordCount)
## 
## 2 3 4 5
## 140 202 72 6
round(cor(cas$nameWordCount, cas[, c("read", "math")]), 2)
##      read math
## [1,] 0.05 0.01

The answer appears to be “no”.

Q. 8.4 Are schools with nice popular nature words in their name doing better academically?

tlw2p  #recall the list of popular names
## lw
## Elementary Union City Valley Joint View
## 0.983 0.433 0.060 0.040 0.031 0.019
## Pleasant San Creek Oak Santa Lake
## 0.017 0.017 0.014 0.014 0.014 0.012
## Mountain Park Rio Vista Grove Lakeside
## 0.012 0.012 0.012 0.012 0.010 0.010
## South Unified West
## 0.010 0.010 0.010
# Create a quick and dirty list of popular nature names
naturenames <- c("Valley", "View", "Creek", "Lake", "Mountain", "Park",
"Rio", "Vista", "Grove", "Lakeside")

# work out whether the word is in the school name
schsplit <- strsplit(cas$school, " ")
cas$hasNature <- sapply(schsplit, function(X) length(intersect(X,
naturenames)) > 0)
round(cor(cas$hasNature, cas[, c("read", "math")]), 2)
##      read math
## [1,] 0.09 0.08

So we've found a small correlation.

Let's graph the data to see what it means:

ggplot(cas, aes(hasNature, read)) + geom_boxplot() + geom_jitter(position = position_jitter(width = 0.1)) + 
xlab("Has a nature name") + ylab("Mean student reading score")

plot of chunk unnamed-chunk-14

So in the sample nature schools have slightly better reading score (and if we were to graph it, maths scores). However, the number of schools having nature names is actually somewhat small (n= 61) despite the overall quite large sample size.

But is it statistically significant?

t.read <- t.test(cas[cas$hasNature, "read"], cas[!cas$hasNature, 
"read"])
t.math <- t.test(cas[cas$hasNature, "math"], cas[!cas$hasNature,
"math"])

So, the p-value is less than .05 for reading (p = 0.046) but not quite for maths (p = 0.083). Bingo! After a little bit of data fishing we have found that reading scores are “significantly” greater for those schools with the listed nature names.

But wait: I've asked three separate exploratory questions or perhaps six if we take maths into account.

  • $\frac{.05}{3} =$ 0.0167
  • $\frac{.05}{6} =$ 0.0083

At these Bonferonni corrected p-values, the result is non-significant. Oh well…

Review

Anyway, the aim of this post was not to make profound statements about California schools. Rather the aim was to show how easy it is to produce quick reproducible reports with R Markdown. If you haven't already, you may want to open up the R Markdown file used to produce this post in RStudio, and compile the report yourself.

In particular, I can see R Markdown being my tool of choice for:

  • Blog posts
  • Posts to StackExchange sites
  • Materials for training workshops
  • Short consulting reports, and
  • Exploratory analyses as part of a larger project.

The real question is how far I can push Markdown before I start to miss the control of LaTeX. Markdown does permit arbitrary HTML. Anyway, if you have any thoughts about the scope of R Markdown, feel free to add a comment.

To leave a comment for the author, please follow the link and comment on his blog: Jeromy Anglim's Blog: Psychology and Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Proportion of marriages ending in divorce

at by Diego Valle-Jones on R-Bloggers

(This article was first published on Diego Valle's Blog, and kindly contributed to R-bloggers)

Over the last two decades families in Mexico have undergone rapid social changes. The proportion of marriages ending in divorce has risen for each cohort since data became available, this is independant of the recently approved express divorce law in the Federal District.

Looking at the the period during which marriages are more likely to end in divorce it seems there is a 3 to 7 year itch.
and by state we see that the rise in the proportion of marriages that end in divorce has been nearly universal across the states of Mexico, though, of course, there is some internal migration within Mexico and more recently people have been divorcing in the Federal District because of its more permissive divorce laws, so the numbers aren't exact.
Cumulative proportion by state:
While it will take more than a decade to obtain long term marriage dissolution data for recent cohorts, the rising pattern looks very predictable and I used a multi-level model to predict future marriage dissolutions. The data has been adjusted because divorces registered in 2009 were under-counted by about 5%, and those registered in 2008 by about 1%.

fit <- lme(fixed = percent.divorce ~ marriage.length + log(marriage.length) + marriage.year + express.divorce,
           random = ~ -1 | marriage.year, 
           data = subset(marriage.duration, marriage.length >= 2))

Cross country comparisons are always problematic, but comparing the percentage of all marriages that end in divorce in Mexico to first marriages that end in divorce in the US (not an apples to apples comparison!), the divorce rate would be lower in Mexico than in the US during the 1950s. See Figure 2 of Marriage and Divorce: Changes and their Driving Forces by Stevenson and Wolfers.

There are a couple of reasons why my assumptions in making the estimates may be wrong:
  • Thought I did add a dummy variable denoting if express divorce had been approved, if divorces keep increasing and increasing in the Federal District eventually they'll make a dent in the country wide statistics. And as I mentioned in the post there is some uncertainty on whether the rise will prove permanent.
  • As we saw in my last post express divorce has proven particularly popular outside the Federal District, so I wouldn't be surprised if in the future other states adopted similar laws.
And let's not forget that births outside of marriage are now a majority of births in Mexico, so looking at marriage dissolution provides only a partial look at what union dissolution is like



P.S. You can download the code from my GitHub account

To leave a comment for the author, please follow the link and comment on his blog: Diego Valle's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Visualizing the CRAN: Graphing Package Dependencies

at by wrathematics on R-Bloggers

(This article was first published on librestats » R, and kindly contributed to R-bloggers)

I had been meaning to start toying with the igraph package for a while. So a few weeks ago (lay off, I'm busy), I decided to grab a bunch of CRAN data about package dependencies. The easiest way that I could think to get this information was to just grab the html files for all the package descriptions and chop through them.

Quick note before I forget: I'm not looking at any base packages. Only packages in the CRAN in the pictures to follow.

Much of this information can be gathered from installed.packages(), assuming of course you have every package installed (and if you're on Windows, that's impossible). This also requires you to download gigabytes of data, rather than my method which yanks about 60mb of html.

I'll get to just how I scraped the data in a bit, but first, the pictures. If you are interested in the igraph package, you should really stop tooling around dumb boring blogs like this one and read the documentation, because it is a real gem. This is the gold standard for all aspiring R package developers out there.

Before we get going, here's the dataset in case you want to follow along. Quick warning, I don't recommend you try to open this up in a spreadsheet program. Uncompressed, the dataset is 30mb and is basically a 4000x4000 matrix (compressed it's just a few hundred kb). A quick note about the dataset; it is a matrix consisting of the numbers 0, 1, and 2.  A 0 means there is no relation between packages, a 1 means there is a dependence between packages, and a 2 means that one package suggests another.  More specifically (with dependence as an example), the entry X[m,n] = 1 means that the package at row m depends on the package at column n.

Ok, so pretty picture time.  Quick note, for the purpose of displaying these on a webpage, I can't use nice high quality pdf's.  So I have to use png's, which really blow up in file size if I attempt to have any kind of reasonable resolution.  But this should give you some idea of how things look, anyway.

So this is probably no surprise to you, but MASS is the most "depended on" package on the CRAN.  A total of 294 of the 3794 packages (almost 8%) in the CRAN depend on MASS.  An additional 95 suggest MASS; so just over 10% of all R packages either suggest or depend on MASS.

That's neat and all (actually it's crazy impressive), but I want a pretty graph.  Here's a link diagram of all the packages that depend on MASS


Which really is quite impressive. Here (and throughout), A-->B means B depends on A.  Another package that is near and dear to my heart is Dr. Hyndman's amazing forecast package.  Here's a linkage graph for all packages that depend on and suggest forecast

I even included CRAN taskview diagrams.  Here's the linkage diagram for the machine learning taskview

I made a little function that's mostly coherent (lay off, I'm busy) so you can visualize your own package. Or spy on someone else's. I'm not here to judge you, friend.

showlinkage <- function(df, package, suggests=FALSE, task=NULL, include.self=TRUE){
	require(igraph)

	tasks <- x[(ncol(df)-28):ncol(df)]
	df <- df[1:(ncol(df)-29)]

	if (!suggests){
		df[] <- lapply(df, function(x){replace(x, x == 2, 0)})
	} else {
		df[] <- lapply(df, function(x){replace(x, x == 2, 1)})
	}

	if (!is.null(task)){
		task <- paste("taskclass", task, sep="")
		rc <<- which(tasks[which(colnames(tasks)==task)]==1)
		df <- df[rc, rc]
	} 

	if (package != "ByTask"){
		df <- df1==1 | rownames(df)==package), which(df[which(rownames(df)==package), ]==1)), c(which(df[package]==1 | rownames(df)==package), which(df[which(rownames(df)==package), ]==1))]
		if (!include.self){
			df <- df[-which(rownames(df)==package), -which(colnames(df)==package)]
		}
	}

	df <- t(df) # reverse linkage arrows

	g <- graph.adjacency(df, mode="directed", weighted=TRUE, add.rownames=TRUE)

	plot(g, vertex.label.color="blue",
			edge.width=.5, edge.color="#ACC49D", edge.arrow.size=.5,
			vertex.label.cex=.7, vertex.label=row.names(df), vertex.size=0, vertex.color="white",
			#layout=layout.reingold.tilford
			layout=layout.kamada.kawai
	)
}

So for the first MASS graph, you would do

showlinkage(x, "MASS", include.self=TRUE, suggests=FALSE)

Setting include.self to FALSE here would drop MASS from the diagram, so you could see the relationship of all the packages that depend on MASS without having MASS sitting there in the diagram with all its...erm...mass.

For the taskview stuff, you would do

showlinkage(x, "ByTask", include.self=TRUE, suggests=FALSE, task="MachineLearning")

So what about all of the CRAN dependencies at once?  This is harder to do because there's just SO MUCH STUFF out there in the CRAN.  Now, as it turns out, just over 71% of all packages aren't dependencies for other packages.  So while there is a STAGGERING amount of package inbreeding out there, still the majority of packages are lonely little islands.

Given the volume of stuff we're working with here, we need to think about the layout algorithm.  That is, we would have to, except the handsome people on the igraph project already did the thinking for us, saving our brain cells for pop songs and beer.

Going through a few of the different options for layout algorithms, we can get different insights into these CRAN linkages.  First, using layout.fruchterman.reingold

Next with layout.circle (which for this data is kind of dumb, but you really get a sense for how much linking is going on)

We could also do layout.graphopt

And for my last example, layout.drl

Of course, if you have the dataset, you can tinker with these options some more and plot in your very own way.  For these graphs, assuming you store the csv in the object x, you might do something like:

y <- x[1:(ncol(x)-29)]
y[] <- lapply(y, function(x){replace(x, x == 2, 0)}) 

g <- graph.adjacency(y, mode="directed", weighted=TRUE, add.rownames=TRUE)

plot(g,
		edge.width=.5, edge.color="#ACC49D", edge.arrow.size=.5,
		vertex.label.cex=.7, vertex.label="", vertex.size=1, vertex.color="black",
		layout=layout.fruchterman.reingold
)

which would reproduce the first plot.

Well, in my usual fashion, here's how I grabbed up all the data. In the past I have used shell to do my data scraping because I can script these things out in about 5-10 minutes in shell. Things are no different this time (have I mentioned that I am busy?), so that's what you get.

#!/bin/sh

# -------------------------------------------------------
# setup/mirroring
# -------------------------------------------------------

outdir="wherever"
cd $outdir
touch out.csv

wget -e robots=off -r -l5 --no-parent -A.html http://cran.r-project.org/web/packages

mirloc="cran.r-project.org/web/packages"

# -------------------------------------------------------
# outfile setup
# -------------------------------------------------------

packages=`ls $mirloc`

for p in $packages;do
	echo -n "$p," >> out.csv
done

tasks=`cat $mirloc/../views/index.html | grep "</a>" | sed -e 's/<[^>]*>//g'`

echo -n "PackageName" >> out.csv

for t in $tasks;do
	echo -n "taskclass$t," >> out.csv
done

sed -i 's/.$//' out.csv

echo "" >> out.csv

# -------------------------------------------------------
# generating data for the adjacency matrix
# -------------------------------------------------------

for prow in $packages;do
	out=""

	depends=`grep -A 1 "<tr><td valign=top>Depends:</td>" -i $mirloc/$prow/index.html | tail -n 1 | awk -F "," '{ $1=""; print $0 }' | sed -e 's/<[^>]*>//g' -e 's/([^)]*)//g'`
	suggests=`grep -A 1 "<tr><td valign=top>Suggests:</td>" -i $mirloc/$prow/index.html | tail -n 1 | awk -F "," '{ $1=""; print $0 }' | sed -e 's/<[^>]*>//g' -e 's/([^)]*)//g'`

	echo -n $prow >> out.csv

	for pcol in $packages;do
		isin=0
		for d in $depends;do
			if [ $pcol = $d ];then
				isin=1;break
			fi
		done
		for s in $suggests;do
			if [ $pcol = $s ];then
				isin=2;break
			fi
		done
		out="$out,$isin"
	done

	taskview=`grep -r "${prow}</a>" $mirloc/../views/*.html | awk -F ":" '{print $1}' | uniq | sed -e 's/.html//' -e 's/.*\///g'`
	for t in $tasks;do
		isin=0
		for tv in $taskview;do
			if [ $tv = $t ];then
				isin=1;break
			fi
		done
		out="$out,$isin"
	done
	echo $out >> out.csv #| sed 's/,//' >> out.csv
done

To leave a comment for the author, please follow the link and comment on his blog: librestats » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Bar Graph Colours That Work Well

at by inkhorn82 on R-Bloggers

(This article was first published on Data and Analysis with R, at Work, and kindly contributed to R-bloggers)

Ever since I started using ggplot2 more often at work in order to do graphs, I’ve realized something about the use of colour in bar graphs vs. dot plots: When I’m looking at a graph displayed on the brilliant Viewsonic monitor I’m using at work, the same relatively intense colours that work well in a dot plot start to bother me in a bar graph.  Take the bar graph immediately below for example.  The colour choice is not a bad one, but there’s something about the intensity of the colours that makes me want to find a new set of colours somewhat more soothing to my eyes.

The first resource I found was a “Color Encyclopedia” website called Color Hex and started looking for colours that seemed more soothing and could be used to compare 3 bars against one another in a bar graph.  You can search for colour names, colours according to their hexadecimal values, or even browse their list of “web safe colors”.  I stumbled upon the particular purple displayed in the graph below, and it simply gave me the other colours in the triad as suggestions.

Looking at this triad of colours, I’m actually quite pleased, but I still didn’t really understand why these colours worked, and how to select a new triad that didn’t bother me.  I shuffled through many different colours on the Color Hex website, and nothing else seemed to work with me as I wasn’t selecting colours based on any theory.

Then I stumbled upon an article by the good people at Perceptual Edge.  They seemed to confirm my earlier statement about the same intense colours working well when used to colour dot plots not working so well in bar plots.   Their solution is a simple one: choose from a list of colours of medium intensity.  On page 6 of the document linked above, they offer 8 different hues that look nice in a bar plot.  All I had to do to use these colours in the below plots was take a screenshot of the document, bring it into Inkscape, and hover the eye-dropper tool over the colours to get the hexadecimal colour values.  If you’re interested in using the values, I typed them out at the bottom of my post.  Now take a look at the graphs below:

The two graphs above follow the same principle that I had unknowingly touched upon when I chose the colours from the Color Hex website: stick with medium intensity, and your eyes won’t be jarred by the colour contrast.  I like that!

Anyway, below I show you the code I used to manually input the hexadecimal colour values into my ggplot bar graphs, and the list of 8 hexadecimal colour values corresponding with the colour boxes on page 6 of the Perceptual Edge document.  The variables a, c, and b were just variable names from a mock data frame that I cooked up for the purpose of the plots.

> colours = c(“#599ad3″, “#f9a65a”, “#9e66ab”)
> ggplot(e, aes(x=a, y=c, fill=b, stat=”identity”)) + geom_bar(position=”dodge”) + coord_flip()+scale_fill_manual(values=colours)

#727272
#f1595f
#79c36a
#599ad3
#f9a65a
#9e66ab
#cd7058
#d77fb3


To leave a comment for the author, please follow the link and comment on his blog: Data and Analysis with R, at Work.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Orbitz: R has become the data-mining tool of choice

at by David Smith on Revolutions

Sameer Chopra, vice president of Advanced Analytics at Orbitz Worldwide, wrote recently in Analytics magazine about the changing landscape of processes, software and systems for statistical modelers. In a section on "Big Data and Open Source Analytics", Chopra lays out the reasons why the R language "has become the data-mining tool of choice for machine learners":

  • R has very good integration with Hadoop, an area where established commercial statistical tools have frankly been playing catch-up over the past year. (Note: At the time of this writing, some established statistical solution providers were announcing an access interface to Hadoop.)
  • Many startups and smaller firms do not have deep pockets and are embracing open source tools such as the R programming language and NoSQL database systems such as MongoDB.
  • R is a leading language for developing new statistical methods, and it is a platform for statistical innovation and collaboration across both the corporate world and academia. In my opinion, for the first time in years, the stronghold of established commercial players seems to be potentially threatened; open source tools are better suited for Big Data and will slowly but surely continue to take share away from commercialized statistical packages. In fact, traditional statistical vendors have recognized that R is a force to be reckoned with. In response, many of these vendors have developed hooks into R so users can interface with the R language.
  • Based on the resumes I’ve been reading, the next generation of data miners is flocking to R as their go-to tool. Professors in general are comfortable with R; they tend to use R and Excel as part of their curriculum.
  • In short, open-source analytics tools and platforms have arrived.

Chopra says that the usage of R in the commercial sector is growing "as firms such as Revolution Analytics focus on the enterprise capabilities for R" (for example, Revolution R Enterprise's Hadoop support and enterprise deployment).

Chopra also has some interesting perspectives on statistical modeling vs machine learning which you can find in the full article linked below.

Analytics magazine: The times they are a changin’ for advanced analytics

Monitor: Removing zero values from the data set.

at by jrcuesta on R-Bloggers

(This article was first published on NIR-Quimiometría, and kindly contributed to R-bloggers)


I continue developing the Monitor function. This time a video from  "r twotorials”:
“how to access different records within a data frame by using logical tests in r”, gave me the idea to remove the zero values from a data set.
When somebody give you a table like the one I show, if the laboratory did not analyze a sample for any reason, I write a cero. So cero is not a reference value.
It will be diferent if we have for example: 0,001 or 0,00005, in this case we are managing very low concentrations, but we dont say just 0.
Anyway we can do a similar approach if we prefer "NA" indeed "0".
So, I prepare a new video with how the function performs right now:
Hope you like.

To leave a comment for the author, please follow the link and comment on his blog: NIR-Quimiometría.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Where's Waldo? Image Analysis in R

at by David Smith on Revolutions

R user Arthur Charpentier attempts to use the raster library and R functions to find Waldo in a "Where's Waldo" image:

Screen Shot 2012-05-17 at 1.47.51 PM

Sadly, it turned out that Waldo was a bit too tricky to spot using these techniques. But Arthur did have more success identifying the US flag in a shot from the Apollo mission, and identifying answers in the form for a multiple-choice test. All of the R code is provided at the link below, so that's a great place to start if you're looking to do some image analysis in R yourself.

Freakonometrics: Finding Waldo, a flag on the moon and multiple choice tests, with R

Where’s Waldo? Image Analysis in R

at by David Smith on R-Bloggers

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R user Arthur Charpentier attempts to use the raster library and R functions to find Waldo in a "Where's Waldo" image:

Screen Shot 2012-05-17 at 1.47.51 PM

Sadly, it turned out that Waldo was a bit too tricky to spot using these techniques. But Arthur did have more success identifying the US flag in a shot from the Apollo mission, and identifying answers in the form for a multiple-choice test. All of the R code is provided at the link below, so that's a great place to start if you're looking to do some image analysis in R yourself.

Freakonometrics: Finding Waldo, a flag on the moon and multiple choice tests, with R

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Emulating local static variables in R

at by Bogumił Kamiński on R-Bloggers

(This article was first published on R snippets, and kindly contributed to R-bloggers)

Recently I was writing a code allowing to plot multiple ggplot2 plots on one page. I wanted to replicate standard behavior of  plot  function that plots graphs in sequence according to  mfrow/ mfcol option in par. The solution lead me to think of emulating C-like local static variables in R.
There are several solutions to this problem but I think that a nice one is by adding attributes to a function. Here is a simple example:

f <- function(x) {
    y <- attr(f, "sum")
    if (is.null(y)) {
        y <- 0
    }
    y <- x + y
    attr(f, "sum") <<- y
    return(y)
}

It can be applied as follows:

> for (in 1:5) cat(i, ": ", f(i)"\n", sep="")
1: 1
2: 3
3: 6
4: 10
5: 15

As it can be seen attribute "sum" is static but it can be thought of as local because it is not stored directly as a variable in global environment.

And here is the application of the concept to the problem of plotting several qplots in a sequence:

library(ggplot2)
library(grid)

# setup the ploting grid and plotting sequence
mplot.setup <- function(nrow, ncol, by.row = TRUE) {
    attributes(mplot.seq) <<- list(nrow = nrow, ncol = ncol,
      pos = 0, by.row = by.row)
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow, ncol)))
}  

# plot at given grid location
mplot <- function(graph, row, col) {
     print(graph, vp = viewport(layout.pos.row = row,
                                layout.pos.col = col))
}

# plot the at the next position in the sequence
mplot.seq <- function(graph) {
    pos <- attr(mplot.seq, "pos")
    nrow <- attr(mplot.seq, "nrow")
    ncol <- attr(mplot.seq, "ncol")

    if (attr(mplot.seq, "by.row")) {
        col <- 1 + (pos %% ncol)
        row <- 1 + ((pos %/% ncol) %% nrow)
    } else {
        row <- 1 + (pos %% nrow)
        col <- 1 + ((pos %/% nrow) %% ncol)
    }
    attr(mplot.seq, "pos") <<- pos + 1
    mplot(graph, row, col)
}

# application example
mplot.setup(2,4, FALSE)
for (i in 1:4) {
  mplot.seq(qplot(iris[,i], xlab = names(iris)[i]))
  mplot.seq(qplot(iris[,5], iris[,i], geom = "boxplot",
    xlab = "Species", ylab = names(iris)[i]) + coord_flip())
}

The following plot is produced by the above code:


To leave a comment for the author, please follow the link and comment on his blog: R snippets.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Orbitz: R has become the data-mining tool of choice

at by David Smith on R-Bloggers

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Sameer Chopra, vice president of Advanced Analytics at Orbitz Worldwide, wrote recently in Analytics magazine about the changing landscape of processes, software and systems for statistical modelers. In a section on "Big Data and Open Source Analytics", Chopra lays out the reasons why the R language "has become the data-mining tool of choice for machine learners":

  • R has very good integration with Hadoop, an area where established commercial statistical tools have frankly been playing catch-up over the past year. (Note: At the time of this writing, some established statistical solution providers were announcing an access interface to Hadoop.)
  • Many startups and smaller firms do not have deep pockets and are embracing open source tools such as the R programming language and NoSQL database systems such as MangoDB.
  • R is a leading language for developing new statistical methods, and it is a platform for statistical innovation and collaboration across both the corporate world and academia. In my opinion, for the first time in years, the stronghold of established commercial players seems to be potentially threatened; open source tools are better suited for Big Data and will slowly but surely continue to take share away from commercialized statistical packages. In fact, traditional statistical vendors have recognized that R is a force to be reckoned with. In response, many of these vendors have developed hooks into R so users can interface with the R language.
  • Based on the resumes I’ve been reading, the next generation of data miners is flocking to R as their go-to tool. Professors in general are comfortable with R; they tend to use R and Excel as part of their curriculum.
  • In short, open-source analytics tools and platforms have arrived.

Chopra says that the usage of R in the commercial sector is growing "as firms such as Revolution Analytics focus on the enterprise capabilities for R" (for example, Revolution R Enterprise's Hadoop support and enterprise deployment).

Chopra also has some interesting perspectives on statistical modeling vs machine learning which you can find in the full article linked below.

Analytics magazine: The times they are a changin’ for advanced analytics

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Github Follower Graph with R

at by C on R-Bloggers

(This article was first published on R-Chart, and kindly contributed to R-bloggers)

Graph a github user's followers (and follower's followers).



Each programming language tends to develop its own idiomatic set of data structures.  In R, data frames are often the structure of choice.  JSON (a subset of Javascript) has emerged as an open standard for data interchange that has largely displaced XML for many web based APIs.   This post gives an example of how to read JSON from GitHub, manipulate the data within R and produce a graph of the results (like the one above).

A few standard R libraries are required:
  • RCurl to retrieve JSON data (or anything else) from a URL
  • rjson to to parse the JSON data

In this particular API call (an example of json returned - no API key is required for the Github API), the JSON data will represent a GitHub user's (first thirty) followers.  This implies the use of a graph to represent the data so the igraph library will be used as well. 




With the RCurl and rjson libraries available, the json results can be retrieved and converted to an R list as follows:

o <- fromJSON(getURL('https://api.github.com/users/EzGraphs/followers'))





You can check the class for yourself using class(o) and view the length of the list using length(o) 
   
To convert the results to the data frame where rows represent followers and columns represent attributes, unlist the results, transpose the results and cast as a data frame:

df <- as.data.frame(t(sapply(o, unlist)))



That is the basic process - the rest of the code is the details related to getting the data into the iGraph object which can then be rendered using plot, tkplot (show above) or rglplot. 

To leave a comment for the author, please follow the link and comment on his blog: R-Chart.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

More Bixi Data Visualization

at by Corey Chivers on R-Bloggers

(This article was first published on bayesianbiologist » Rstats, and kindly contributed to R-bloggers)

I mentioned in a previous post that our team at the recent Hack/Reduce hackathon had some fun with a data set which consisted of Bixi station states at minute level temporal resolution. In addition to pulling out and plotting the flux at each station on an hourly basis, we also plotted the system state (number of bikes at each station) at each time-step we had. This totalled to 24,217 individual plots. Each plot was generated using an R script which took in the system state at each time-step, and output a png.

Team member Kamal Marhubi also did some nice post-processing to overlay the information on a map. The results are a little mesmerising. Things don’t get fun until about 40s into the video, as the first part mostly just shows the stations coming online for the first part of the season.

And for the non-Montrealers out there, here’s an image of a Bixi bike; our durable, data generating little hero.


To leave a comment for the author, please follow the link and comment on his blog: bayesianbiologist » Rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Reproducible research with markdown, knitr and pandoc

at by kariert on R-Bloggers

(This article was first published on Ecological Modelling... » R, and kindly contributed to R-bloggers)

Over the last few weeks I was trying to optimise my workflow using markdow in combination with knitr and pandoc. Knitr is a grea new package by Yihui, expanding R’s capabilities for reproducible research.

I will illustrate my work flow with the following example, where I have a small R-script (script.r) that I want to embed into a report. However, I do not want to write LaTeX, nor I want/can specify my final output format in the beginning. That is where were pandoc comes in. Pandoc is the swiss-army knife if come to convert between markup languages.

The file script.r:


## @knitr gen-dat
a <- matrix(rnorm(100), nrow=10)

## @knitr plot
 image(a)

The report is written with  pandoc flavored markdown. The file (report_knit_.md) contains


% A sample report
% The author
% `r date()`

<!-- Setting up R -->
`ro warning=FALSE, dev="png", fig.cap="", cache=FALSE or`

<!-- read external r code -->
```{r reading, echo=FALSE}
read_chunk("script.r")
```

# The first part of my R script
Here I can generate my data
```{r}
<<gen-dat>>
```

# Results
An now the reults are plotted
```{r plot-fig, result="asis"}
<<plot>>
```

# More
Of course I can use inline elemtnts: 3 + 3 = `r 3+3`.

For each chunk there are plenty of options to modify it (see options).

To render my report, I need to first knit it in R and then use pandoc to convert it to the final format. This can be done with


Rscript -e "library(knitr); knit('report_knit_.md')"

This results in a pandoc flavored markdown document. Now I can use pandoc to convert this document into all by pandoc supported output format (list of formats):

  • A pdf file: pandoc -s report.md -t latex -o report.pdf
  • A html file: pandoc -s report.md -o report.html (with the -c flag html files can be added easily)
  • Openoffice: pandoc report.md -o report.odt
  • Word docx: pandoc report.md -o report.docx

To leave a comment for the author, please follow the link and comment on his blog: Ecological Modelling... » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

R’s increasing popularity. Should we care?

at by Luis on R-Bloggers

(This article was first published on Quantum Forest » rblogs, and kindly contributed to R-bloggers)

Some people will say ‘you have to learn R if you want to get a job doing statistics/data science’. I say bullshit, you have to learn statistics and learn to work in a variety of languages if you want to be any good, beyond getting a job today coding in R.

R4stats has a recent post discussing the increasing popularity of R against other statistical software, using citation counts in Google Scholar. It is a flawed methodology, at least as flawed as other methodologies used to measure language popularities. Nevertheless, I think is hard to argue against the general trend: R is becoming more popular. There is a deluge of books looking at R from every angle, thousands of packages and many jobs openings asking for R experience, which prompts the following question:

Should you/I/we care?

First answer: no. I try to use the best tool for the job; which often happens to be R but it can also be Python, SAS or Fortran. It is nice to be able to use the same tool, say R, across a range of problems, but there are occasions when it feels like using Excel for statistics: one can do it, but one knows that it isn’t a great idea. I know good statisticians that prefer R, SAS or Genstat; the tool doesn’t make you good in the same way that I could buy a Rickenbacker 4001 and I wouldn’t play like Geddy Lee.

Second answer: yes. Popularity attracts good people, who develop good packages, making new techniques available first in R. This doesn’t matter if you are into plain vanilla analyses (there is nothing wrong with this, by the way). Popularity + open source means that the system has been ported to a diversity of computer systems. Need R in a supercomputer? Done. R in a mac? Done. R for your strange operating system, for which there are C and Fortran compilers? Download it and compile it. Done. There is also the ‘I’m not crazy aspect’: other people take the software seriously.

Gratuitous picture: Firescapes II, night illuminated by bonfire (Photo: Luis).

I find people learning R because of ‘job opportunities’ irritating, in the same way that people learn javascript or java only to get a job. Give me any time people that learn R—or any other language for that matter—because they want to, because they are curious, because they want to share their discoveries with other people. Again, it is the difference between someone competent and someone great at what they do; great analysts are very much better than competent ones.

In the comments for the R4stats post there is a reference to R fanboys. Are R fanboys worse than fanboys of other statistical systems? In some respects the answer is yes, because many R users are also open source & open science supporters. Personally, I support both concepts, although I’m not dogmatic about them: I do buy some proprietary software and often can’t provide every detail about my work (commercially sensitive results). Maybe we are looking for a deeper change: we want to democratize statistics. We push for R not necessarily because it is intrinsically a better language, but because we can envision many people doing statistics to better understand the world around us and R is free. Anyway, I would prefer you call me a Python fanboy with split R personality.

To leave a comment for the author, please follow the link and comment on his blog: Quantum Forest » rblogs.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Exponential decay models

at by Pat on R-Bloggers

(This article was first published on Portfolio Probe » R language, and kindly contributed to R-bloggers)

All models are wrong, some models are more wrong than others.

The streetlight model

Exponential decay models are quite common.  But why?

One reason a model might be popular is that it contains a reasonable approximation to the mechanism that generates the data.  That is seriously unlikely in this case.

When it is dark and you’ve lost your keys, where do you look?  Under the streetlight.  You look there not because you think that’s the most likely spot for the keys to be; you look there because that is the only place you’ll find them if they are there.

Photo by takomabibelot via everystockphoto.com

Long ago and not so far away, I needed to compute the variance matrix of the returns of a few thousand stocks.  The machine that I had could hold three copies of the matrix but not much more.

An exponential decay model worked wonderfully in this situation.  The data I needed were:

  • the previous day’s variance matrix
  • the vector of yesterday’s returns

The operations were:

  • do an outer product with the vector of returns
  • do a weighted average of that outer product and the previous variance matrix

Compact and simple.

There are better ways, it’s just that the time was wrong.

The “right” way would be to use a long history of the returns of the stocks and fit a realistic model.  Even if that computer could have held the history of returns (probably not), it is unlikely it would have had room to work with it in order to come up with the answer.  Plus there would have been lots of data complications to work through with a more complex model.

“The shadows and light of models” used a different metaphor of light in relation to models.  If we can only use an exponential decay model, measuring ignorance is going to be a bit dodgy.  We won’t know how much of the ignorance is due to the model and how much is inherent.

Exponential smoothing

We can do exponential smoothing of the daily returns of the S&P 500 as an example.  Figure 1 shows the unsmoothed returns.  Figure 2 shows the exponential smooth with lambda equal to 0.97 — that is 97% weight on the previous smooth and 3% weight on the current point.  Figure 3 shows the exponential smooth with lambda equal to 1%.

Note: Often what is called lambda here is one minus lambda elsewhere.

Figure 1: Log returns of the S&P 500.

Figure 2: Exponential smooth of the log returns of the S&P 500 with lambda equal to 0.97.

Figure 3: Exponential smooth of the log returns of the S&P 500 with lambda equal to 0.99.

Notice that the start of the smooth in Figures 2 and 3 is a little strange.  The first point of the smooth is the actual first datapoint, which happened to be a little over 1%.  That problem can be solved by allowing a burn-in period — dropping the first 20, 50, 100 points in the smooth.

Chains of weights

The simple process of always doing a weighted sum of the previous smooth and the new data means that the smooth is actually a weighted sum of all the previous data with weights that decrease exponentially.  Figure 4 shows the weights that result from two choices of lambda.

Figure 4: Weights of lagged observations for lambda equal to 0.97 (blue) and 0.99 (gold). The weights drop off quite fast.  If the process were stable, we would want the observations to be equally weighted.  It is easy to do that compactly as well:

  • keep the sum of the statistic you want
  • note the number of observations
  • add the new statistic to the sum and divide by the number of observations

What we are likely to really want is some compromise between these extremes.

Half-life

The half-life of an exponential decay is often given.  This is the number of lags at which the weight falls to half of the weight for the current observation.  Figure 5 shows the half-lives for our two example lambdas.

Figure 5: Half-lives and weights of lagged observations for lambda equal to 0.97 (blue) and 0.99 (gold).

Generally the half-life is presented as if it is an intuitive value.  Well, sounds cool — I’m not sure it tells me much of anything.

Summary

When an exponential decay model is being used, you should ask:

Is there a good reason to use exponential decay, or is it only used because it has always been done like that?

Advances in hardware means that some of what could only be modeled with exponential decay in the past can now be modeled better.  It also means that what could not be done at all before can now be done with exponential decay.

Epilogue

He finds a convenient street light, steps out of the shade
Says something like, “You and me babe, how about it?”

from “Romeo and Juliet” by Mark Knopfler

Appendix R

R is, of course, a wonderful place to do modeling — exponential and other.

naive exponential smoothing

A naive implementation of exponential smoothing is:

> pp.naive.exponential.smooth
function (x, lambda=.97)
{
        ans <- x
        oneml <- 1 - lambda
        for(i in 2:length(x)) {
                ans[i] <- lambda * ans[i-1] + oneml * x[i]
        }
        ans
}

This is naive in at least two senses.

It does the looping explicitly.  For this simple case, there is an alternative.  However, in more complex situations it may be necessary to use a loop.

The function is also naive in assuming that the vector to be smoothed has at least two elements.

> pp.naive.exponential.smooth(100)
Error in ans[i] <- lambda * ans[i - 1] + oneml * x[i] :
  replacement has length zero

This sort of problem is a relative of Circle 8.1.60 in The R Inferno.

better exponential smoothing

Exponential smoothing can be done more efficiently by pushing the iteration down into a compiled language:

> pp.exponential.smooth
function (x, lambda=.97)
{
        xmod <- (1 - lambda) * x
        xmod[1] <- x[1]
        ans <- filter(xmod, lambda, method="recursive",
               sides=1)
        attributes(ans) <- attributes(x)
        ans
}

This still retains some naivety in that it doesn’t check that lambda is of length one.

See also the HoltWinters function in the stats package, and ets in the forecast package.

weight plots

A simplified version of part of Figure 5 is:

> plot(.03 * .97^(400:1), type="l", xaxt="n")
> axis(1, at=c(0, 100, 200, 300, 400),
+    labels=c(400, 300, 200, 100, 0))
> abline(v=400 - log(2) / .03)

time plot

The plots over time were produced with pp.timeplot.

Subscribe to the Portfolio Probe blog by Email

To leave a comment for the author, please follow the link and comment on his blog: Portfolio Probe » R language.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Getting Started with R Markdown, knitr, and Rstudio 0.96

at by Jeromy Anglim on R-Bloggers

(This article was first published on Jeromy Anglim's Blog: Psychology and Statistics, and kindly contributed to R-bloggers)

This post examines the features of R Markdown using knitr in Rstudio 0.96. This combination of tools provides an exciting improvement in usability for reproducible analysis. Specifically, this post (1) discusses getting started with R Markdown and knitr in Rstudio 0.96; (2) provides a basic example of producing console output and plots using R Markdown; (3) highlights several code chunk options such as caching and controlling how input and output is displayed; (4) demonstrates use of standard Markdown notation as well as the extended features of formulas and tables; and (5) discusses the implications of R Markdown. This post was produced with R Markdown. The source code is available here as a gist. The post may be most useful if the source code and displayed post are viewed side by side. In some instances, I include a copy of the R Markdown in the displayed HTML, but most of the time I assume you are reading the source and post side by side.

Getting started

To work with R Markdown, if necessary:

  • Install R
  • Install the lastest version of RStudio (at time of posting, this is 0.96)
  • Install the latest version of the knitr package: install.packages("knitr")

To run the basic working example that produced this blog post:

Prepare for analyses

set.seed(1234)
library(ggplot2)
library(lattice)

Basic console output

To insert an R code chunk, you can type it manually or just press Chunks - Insert chunks or use the shortcut key. This will produce the following code chunk:

```{r}

```

Pressing tab when inside the braces will bring up code chunk options.

The following R code chunk labelled basicconsole is as follows:

```{r basicconsole}
x <- 1:10
y <- round(rnorm(10, x, 1), 2)
df <- data.frame(x, y)
df
```

The code chunk input and output is then displayed as follows:

x <- 1:10
y <- round(rnorm(10, x, 1), 2)
df <- data.frame(x, y)
df
##     x    y
## 1 1 1.31
## 2 2 2.31
## 3 3 3.36
## 4 4 3.27
## 5 5 5.04
## 6 6 6.11
## 7 7 8.43
## 8 8 8.98
## 9 9 8.38
## 10 10 9.27

Plots

Images generated by knitr are saved in a figures folder. However, they also appear to be represented in the HTML output using a data URI scheme. This means that you can paste the HTML into a blog post or discussion forum and you don't have to worry about finding a place to store the images; they're embedded in the HTML. UPDATE: These images may not display in RSS Readers (if you do not see images, go to the original post); I'm currently looking into the upload.fun option for storing images on imgur.)

Simple plot

Here is a basic plot using base graphics:

```{r simpleplot}
plot(x)
```
plot(x)

plot of chunk simpleplot

Note that unlike traditional Sweave, there is no need to write fig=TRUE.

Multiple plots

Also, unlike traditional Sweave, you can include multiple plots in one code chunk:

```{r multipleplots}
boxplot(1:10~rep(1:2,5))
plot(x, y)
```
boxplot(1:10 ~ rep(1:2, 5))

plot of chunk multipleplots

plot(x, y)

plot of chunk multipleplots

ggplot2 plot

Ggplot2 plots work well:

qplot(x, y, data = df)

plot of chunk ggplot2ex

lattice plot

As do lattice plots:

xyplot(y ~ x)

plot of chunk latticeex

Note that unlike traditional Sweave, there is no need to print lattice plots directly.

R Code chunk features

Create Markdown code from R

The following code hides the command input (i.e., echo=FALSE), and outputs the content directly as code (i.e., results=asis, which is similar to results=tex in Sweave).

```{r dotpointprint, results='asis', echo=FALSE}
cat("Here are some dot points\n\n")
cat(paste("* The value of y[", 1:3, "] is ", y[1:3], sep="", collapse="\n"))
```

Here are some dot points

  • The value of y[1] is 1.31
  • The value of y[2] is 2.31
  • The value of y[3] is 3.36

Create Markdown table code from R

```{r createtable, results='asis', echo=FALSE}
cat("x | y", "--- | ---", sep="\n")
cat(apply(df, 1, function(X) paste(X, collapse=" | ")), sep = "\n")
```
xy
11.31
22.31
33.36
43.27
55.04
66.11
78.43
88.98
98.38
109.27

Control output display

The folllowing code supresses display of R input commands (i.e., echo=FALSE) and removes any preceding text from console output (comment=""; the default is comment="##").

```{r echo=FALSE, comment="", echo=FALSE}
head(df)
```
  x    y
1 1 1.31
2 2 2.31
3 3 3.36
4 4 3.27
5 5 5.04
6 6 6.11

Control figure size

The following is an example of a smaller figure using fig.width and fig.height options.

```{r smallplot, fig.width=3, fig.height=3}
plot(x)
```
plot(x)

plot of chunk smallplot

Cache analysis

Caching analyses is straightforward. Here's example code. On the first run on my computer, this took about 10 seconds. On subsequent runs, this code was not run.

If you want to rerun cached code chunks, just delete the contents of the cache folder

```{r longanalysis, cache=TRUE}
for (i in 1:5000) {
lm((i+1)~i)
}
```

Basic markdown functionality

For those not familiar with standard Markdown, the following may be useful. See the source code for how to produce such points. However, RStudio does include a Markdown quick reference button that adequatly covers this material.

Dot Points

Simple dot points:

  • Point 1
  • Point 2
  • Point 3

and numeric dot points:

  1. Number 1
  2. Number 2
  3. Number 3

and nested dot points:

  • A
    • A.1
    • A.2
  • B
    • B.1
    • B.2

Equations

Equations are included by using LaTeX notation and including them either between single dollar signs (inline equations) or double dollar signs (displayed equations). If you hang around the Q&A site CrossValidated you'll be familiar with this idea.

There are inline equations such as $y_i = \alpha + \beta x_i + e_i$.

And displayed formulas:

$$\frac{1}{1+\exp(-x)}$$

knitr provides self-contained HTML code that calls a Mathjax script to display formulas. However, in order to include the script in my blog posts I took the script and incorporated it into my blogger template. If you are viewing this post through syndication or an RSS reader, this may not work. You may need to view this post on my website.

Tables

Tables can be included using the following notation

ABC
1MaleBlue
2FemalePink

Hyperlinks

  • If you like this post, you may wish to subscribe to my RSS feed.

Images

Here's an example image:

image from redmond barry building unimelb

Code

Here is Markdown R code chunk displayed as code:

```{r}
x <- 1:10
x
```

And then there's inline code such as x <- 1:10.

Quote

Let's quote some stuff:

To be, or not to be, that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune,

Conclusion

  • R Markdown is awesome.
    • The ratio of markup to content is excellent.
    • For exploratory analyses, blog posts, and the like R Markdown will be a powerful productivity booster.
    • For journal articles, LaTeX will presumably still be required.
  • The RStudio team have made the whole process very user friendly.
    • RStudio provides useful shortcut keys for compiling to HTML, and running code chunks. These shortcut keys are presented in a clear way.
    • The incorporated extensions to Markdown, particularly formula and table support, are particularly useful.
    • Jump-to-chunk feature facilitates navigation. It helps if your code chunks have informative names.
    • Code completion on R code chunk options is really helpful. See also chunk options documentation on the knitr website.
  • Other recent posts on R markdown include those by :

Questions

The following are a few questions I encountered along the way that might interest others.

Annoying <br/>'s

Question: I asked on the Rstudio discussion site: Why does Markdown to HTML insert <br/> on new lines?

Answer: I just do a find and delete on this text for now.

Temporarily disable caching

Question: I asked on StackOverflow about How to set cache=FALSE for a knitr markdown document and override code chunk settings?

Answer: Delete the cache folder. But there are other possible workflows.

Equivalent of Sexpr

Question: I asked on Stack Overvlow about whether there is an R Markdown equivalent to Sexpr in Sweave?.

Answer: Include the code between brackets of “bactick r space” and “backtick”. E.g., in the source code I have calculated 2 + 2 = 4 .

To leave a comment for the author, please follow the link and comment on his blog: Jeromy Anglim's Blog: Psychology and Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

garch() uncertainty

at by xi'an on R-Bloggers

(This article was first published on Xi'an's Og » R, and kindly contributed to R-bloggers)

As part of an on-going paper with Kerrie Mengersen and Pierre Pudlo, we are using a GARCH(1,1) model as a target. Thus, the model is of the form

y_t=\sigma_t \epsilon_t \qquad \sigma^2_t = \alpha_0 + \alpha_1 y_{t-1}^2 + \beta_1 \sigma_{t-1}^2

which is a somehow puzzling object: the latent (variance) part is deterministic and can be reconstructed exactly given the series and the parameters. However, estimation is not such an easy task and using the garch() function in the tseries package leads to puzzling results! Indeed, simulating data shows some high variability of the procedure against starting values:

genedata=function(para,nobs){

   pata=epst=sigt=rnorm(nobs)
   sigt[1]=sqrt(para[1])
   pata[1]=epst[1]*sigt[1]
   for (t in 2:nobs){
      sigt[t]=sqrt(para[1]+para[2]*pata[t-1]^2+para[3]*sigt[t-1]^2)
      pata[t]=epst[t]*sigt[t]
      }
   list(pata=pata,sigt=sigt,epst=epst)
}
> x = genedata(c(1, 0.3, 0.2),1000)$pata
> garch(x,trace=FALSE)

Call:
garch(x = x, trace = FALSE)

Coefficient(s):
       a0         a1         b1
4.362e+00  1.976e-01  6.805e-14
> garch(x,trace=FALSE,start=c(1,.3,.2))

Call:
garch(x = x, trace = FALSE, start = c(1, 0.3, 0.2))

Coefficient(s):
    a0      a1      b1
0.8025  0.2592  0.3255
> simgarch=genedata(c(1, 0.2, 0.7),1000)

Call:
garch(x = simgarch$pat, trace = FALSE)

Coefficient(s):
a0         a1         b1
8.044e+00  1.826e-01  4.051e-14
> garch(simgarch$pat,trace=FALSE,star=c(1, 0.2, 0.7))

Call:
garch(x = simgarch$pat, trace = FALSE, star = c(1, 0.2, 0.7))

Coefficient(s):
a0      a1      b1
1.1814  0.2079  0.6590

The above code clearly shows the huge impact of the starting value on the final estimate….


Filed under: R, Statistics, University life Tagged: GARCH, R, times series, tseries

To leave a comment for the author, please follow the link and comment on his blog: Xi'an's Og » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Revolution Analytics RHadoop appears to be useless as no documents can be found, looking at Oracle

at by caustic on R-Bloggers

(This article was first published on QUANTLABS.NET » R, and kindly contributed to R-bloggers)

Revolution Analytics RHadoop appears to be useless as no documents can be found, looking at Oracle This is somewhat of a joke. I thought this could be done fairly easily but nope, this company appears to be a bunch of space cadets in providing documents to set up their RHadoop package product., They push R but there is no proper documents on how to install RHadoop for us R newbies Everything has installed great thus far until this!  Pretty crappy I say if they want it go places. I found no documentation on how to accomplish this. As a result, I am dumping it and investigating Oracle's solutions. It looks like this Revolution Analytics could be a bunch of hacks unless they can convince me otherwise. More evidence of this: http://www.quora.com/How-can-R-and-Hadoop-be-used-together I thought RHive would work but that is promised to be slow as it is more analyzing data which Hive appears to do. ANother good potential option is RHIPE at http://www.rhipe.org/getpack.html

Get our FREE Open Source Historical Database by answering the 2 WORLD'S FASTEST TRADER/QUANT QUESTIONS

Post to Twitter

To leave a comment for the author, please follow the link and comment on his blog: QUANTLABS.NET » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Global Homicide Rates by Government Type

at by Patrick Rhodes on R-Bloggers

(This article was first published on Graph of the Week, and kindly contributed to R-bloggers)

world homicide rates by democracy index
Surprising results

For purposes of this article, any mention of homicide rates refers to reported homicide rates.

Open vs Closed

In mostly open countries (full democracies), the homicide rates are rather low when compared to other types of governments - except for authoritarian regimes. Left open to speculation, the reasons can be many. Perhaps people in free societies are happier - happy people don't tend to murder other people, otherwise they wouldn't be happy. In a previous articles, it was shown that full democracies produce longer life spans and are world leaders in technological advancements. So, not only do people thrive when free, but they live longer and pursue creativity.

In mostly closed countries (authoritarian regimes), the homicides rates are also low (comparatively speaking). It's very well possible that not all homicides are reported in these countries - especially given recent events regarding the revolts in the Arabian countries (see Arab Spring). Many thousands have died during this time which would not be reported as a 'homicide.' When the government murders a person, it is viewed as eliminating a criminal - not a homicide (in all countries, that is). But if the rates are somewhat accurate, then why so low? Perhaps it is due to fear - people who are afraid for their own lives probably will not kill others if that means their own death.



Who Reports the Most Homicides?

Flawed democracies.

These are the 2nd tier most 'free' countries. That begs the question - what exactly is a flawed democracy? It's not a simple answer - it's a rating based on a country's democracy index. In summary, it is based on five categories: electoral process and pluralism, civil liberties, the functioning of government, political participation and political culture. Click here for a full description.

Why is this? Perhaps those countries have unemployment problems, gender equality problems or culture problems. The question begs further research in this area and this author has not yet found a statistical correlation. If interested, the worst offenders include Honduras, Jamaica, El Savador, Trinidad and Tobago, South Africa, Brazil, and Columbia.

Anybody with insight into this area is welcome to comment. At any rate, if you are planning on visiting or moving to a country with a 'flawed democracy', you should keep the data presented here in mind.

global government category map
Too many authoritarian regimes

For a global map categorized by government type, see the chart above. Nearly 40% of the world is under authoritarian rule (much of that is China), while only 13% are under a full democracy. It would be interesting to see a similar map at one hundred year intervals going back in time.

A country can indeed make the transition from authoritarian regime to full democracy, but usually not without significant loss of life. Both Germany and Japan were authoritarian before and during World War II, but have since become very free countries. Hopefully this will occur in the aforementioned Arabian countries as well - time will tell.

Conclusion

This is the third article showing a relationship between freedom and its effects on people (the first two are here and here). Based on the data, this series of articles (strongly) suggests that people thrive more in free countries than oppressed ones. In this particular instance, there is a relationship between freedom and low homicide rates as well as a similar one for authoritarian regimes. All things being equal, this author chooses to live freely. Without that freedom, it wouldn't be possible to even find data for such an article as this much less publish it. Oppression is only beneficial to the oppressors.



Questions:
1) When looking at history, is there an increasing or decreasing trend in the number of full democracies?
2) How many homicides are not reported?
3) What would this map look like if all wrongful deaths were accurately reported?

Data:
Code:
These graphs were generated using the 'ggplot2' package within the R programming language. In the future, different plotting packages will begin to be seen here, including 3d plots and trellis plots. Stay tuned!

1st graph:

ggplot(subset(homicide.merge.frame, Source.Level.1=='Police'), aes(x=Index, y=Rate, group=Category, fill=Category)) +
geom_boxplot() +
ylab("Homicides per 100,000") +
xlab("Democracy Index") +
opts(title="Flawed Democracies Produce Hightest Homicide Rates",
legend.title = theme_blank(),
panel.background = theme_blank())
2nd graph:

ggplot(globe, aes(long, lat, group=group)) +

geom_polygon(data = dem.globe.index.frame, aes(x=long, y=lat, fill=Category, group=group), size=0.2) +
geom_polygon(colour="black", alpha=0.2) +

ylab("Latitude") +
xlab("Longitude") +

opts(title="Government Category Map",
legend.title = theme_blank(),
panel.background = theme_blank())

Further Reading:

To leave a comment for the author, please follow the link and comment on his blog: Graph of the Week.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

My take on polar bar (a.k.a. consultant’s) charts

at by Luca Fenu on R-Bloggers

(This article was first published on gRaphics!, and kindly contributed to R-bloggers)

Once upon a time, when I was working at Johnson & Johnson (pharma branch), I was surrounded by a bunch of programmers working to develop (among other things) a nifty piece of software for internal use. Part of it was later released as freeware, called Vlaaivis. The main idea was to visualize each the compound's many data at once, with each property being represented by a slice within the pie. For each property, ti was possible to define an ideal value, or range, and values above or below that one would show up as incomplete slices of two shades of the same color... This way, the fuller the pie the better, or more 'dieal', the compound under exam would be.
You may find it still at its home, http://www.vlaaivis.com/.

As I moved on to different things, I still kept that as a good way of visualising multifactorial data, especially when comparing several candidates (compounds, in my case). Inspired by a post on LearnR, I decided to start reimplementing something similar in R for use within our own group.

The most significant change from the LearnR post is that I added in a 'facets_grid()' call to the ggplot, so as to split the polar bar chart in the different compounds. Thanks to the faceting, all compounds are plotted on the same 'space' and are therefore immediately comparable.

The other, minor, change I made was to create the dataframe as a matrix, which is the way we usually store such kind of data, and use 'melt', from the package 'reshape', to convert it into a form suitable for plotting as bar chart (polar or not). I left in the comments an option for reading in the data from a text file...

Here's the dummy template I came up with:



And here's my output:

If, as I described previously, the bars were some kind of normalised score, such as the recently suggested druglikeness score, then the fuller the pie, the better looking the compound would be for a medicinal chemist.

I left the legend in for the moment, but in case of many compounds this may be omitted, since the variable names (a-e) is present in each plot (does anyone know how can I get rid of the 0.5 legend key? it comes from the alpha definition in the ggplot).

Two major things left to do:


  1. I would like to plot the compound in order of 'fullness' - a sort/order snippet is there in the code, and the new ordering survives the melting of the matrix - however, ggplot seems to rearrange the data according to some internal order... (any hints?)
  2. Right now, the code isn't suitable for too many compounds, since the facets_grid() will arrange them horizontally. I would be grateful if someone were to let me know how to automatically arrange them in a grid of a given maximum number of columns... I know how to do that when I explicitly create each plot, but then I loose the ease of comparison which comes from all compounds being plotted on the same scale...


I'll update this text and the code as I improve the visualization.

If you don't feel like messing around with code, you can always try and build a similar plot using deduceR:
http://www.r-statistics.com/2010/08/rose-plot-using-deducers-ggplot2-plot-builder/

Hope one or more of you find this useful!

To leave a comment for the author, please follow the link and comment on his blog: gRaphics!.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

RProtoBuf 0.2.4

at by Thinking inside the box on R-Bloggers

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new release 0.2.4 of RProtoBuf is now on CRAN. RProtoBuf provides GNU R bindings for the Google Protobuf data encoding library used and released by Google.

This release once again contains a number of patches kindly contributed by Murray Stokely, as well as an added header file needed to build with the g++ 4.7 version which has become the build standard on CRAN.

The NEWS file entry follows below:

0.2.4   2012-05-15

    o   Applied several patches kindly supplied by Murray Stokely to
         - properly work with repeated strings 
         - correct C++ function naming in a few instances
         - add an example of ascii export/import of messages

    o   Suppport g++-4.7 and stricter #include file checking by adding unistd

    o   Made small improvements to the startup code

CRANberries also provides a diff to the previous release 0.2.3. More information is at the RProtoBuf page which has a draft package vignette, a 'quick' overview vignette and a unit test summary vignette. Questions, comments etc should go to the rprotobuf mailing list off the RProtoBuf page at R-Forge. Updated to show NEWS rather than ChangeLog

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Finding Waldo, a flag on the moon and multiple choice tests, with R

at by arthur charpentier on R-Bloggers

(This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers)

I have to admit, first, that finding Waldo has been a difficult task. And I did not succeed. Neither could I correctly spot his shirt (because actually, it was what I was looking for). You know, that red-and-white striped shirt. I guess it should have been possible to look for Waldo's face (assuming that his face does not change) but I still have problems with size factor (and resolution issues too). The problem is not that simple. At the http://mlsp2009.conwiz.dk/ conference, a price was offered for writing an algorithm in Matlab. And one can even find Mathematica codes online. But most of the those algorithms are based on the idea that we look for similarities with Waldo's face, as described in problem 3 on http://www1.cs.columbia.edu/~blake/'s webpage. You can find papers on that problem, e.g. Friencly & Kwan (2009) (based on statistical techniques, but Waldo is here a pretext to discuss other issues actually), or more recently (but more complex) Garg et al. (2011) on matching people in images of crowds.

What about codes in R ? On http://stackoverflow.com/, some ideas can be found (and thank Robert Hijmans for his help on his package). So let us try here to do something, on our own. Consider the following picture,

With the following code (based on the following file) it is possible to import the picture, and to extract the colors (based on an RGB decomposition),
> library(raster)
> waldo=brick(system.file("DepartmentStoreW.grd",
+ package="raster"))
> waldo
class       : RasterBrick
dimensions  : 768, 1024, 786432, 3 (nrow,ncol,ncell,nlayer)
resolution  : 1, 1  (x, y)
extent      : 0, 1024, 0, 768  (xmin, xmax, ymin, ymax)
coord. ref. : NA
values      : C:\R\win-library\raster\DepartmentStoreW.grd
min values  : 0 0 0
max values  : 255 255 255
My strategy is simple: try to spot areas with white and red stripes (horizontal stripes). Note that here, I ran the code on a Windows machine, the package is not working well on Mac. In order to get a better understanding of what could be done, let us start with something much more simple. Like the picture below, with Waldo (and Waldo only). Here, it is possible to extract the three colors (red, green and blue),
> plot(waldo,useRaster=FALSE)

     

It is possible to extract the red zones (already on the graph above, since red is a primary color), as well as the white ones (green zones on the graphs means a white region on the picture, on the left)
# white component
white = min(waldo[[1]] , waldo[[2]] , waldo[[3]])>220
focalswhite = focal(white, w=3, fun=mean)
plot(focalswhite,useRaster=FALSE)
 
# red component
red = (waldo[[1]]>150)&(max(  waldo[[2]] , waldo[[3]])<90)
focalsred = focal(red, w=3, fun=mean)
plot(focalsred,useRaster=FALSE)
i.e. here we have the graphs below, with the white regions, and the red ones,

From those two parts, it has been possible to extract the red-and-white stripes from the picture, i.e. some regions that were red above, and white below (or the reverse),
# striped component
striped = red; n=length(values(striped)); h=5
values(striped)=0
values(striped)[(h+1):(n-h)]=(values(red)[1:(n-2*h)]==
TRUE)&(values(red)[(2*h+1):n]==TRUE)
focalsstriped = focal(striped, w=3, fun=mean)
plot(focalsstriped,useRaster=FALSE)
So here, we can easily spot Waldo, i.e. the guy with the red-white stripes (with two different sets of thresholds for the RGB decomposition)

 Let us try somthing slightly more complicated, with a zoom on the large picture of the department store (since, to be honest, I know where Waldo is...).

Here again, we can spot the white part (on the left) and the red one (on the right), with some thresholds for the RGB decomposition

Note that we can try to be (much) more selective, playing with threshold. Here, it is not very convincing: I cannot clearly identify the region where Waldo might be (the two graphs below were obtained playing with thresholds)

 And if we look at the overall pictures, it is worst. Here are the white zones, and the red ones,

and again, playing with RGB thresholds, I cannot spot Waldo,

Maybe I was a bit optimistic, or ambitious. Let us try something more simple, like finding a flag on the moon. Consider the picture below on the left, and let us see if we can spot an American flag,

Again, on the left, let us identify white areas, and on the right, red ones

Then as before, let us look for horizontal stripes

Waouh, I did it ! That's one small step for man, one giant leap for R-coders ! Or least for me... So, why might it be interesting to identify areas on pictures ? I mean, I am not Chloe O'Brian, I don't have to spot flags in a crowd, neither Waldo, nor some terrorists (that might wear striped shirts). This might be fun if you want to give grades for your exams automatically. Consider the two following scans, the template, and a filled copy,

A first step is to identify regions where we expect to find some "red" part (I assume here that students have to use a red pencil). Let us start to check on the template and the filled form if we can identify red areas,
exam = stack("C:\\Users\\exam-blank.png")
red = (exam[[1]]>150)&(max(  exam[[2]] , exam[[3]])<150)
focalsred = focal(red, w=3, fun=mean)
plot(focalsred,useRaster=FALSE) 
exam = stack("C:\\Users\\exam-filled.png")
red = (exam[[1]]>150)&(max(  exam[[2]] , exam[[3]])<150)
focalsred = focal(red, w=3, fun=mean)
plot(focalsred,useRaster=FALSE)


First, we have to identify areas where students have to fill the blanks. So in the template, identify black boxes, and get the coordinates (here manually)
exam = stack("C:\\Users\\exam-blank.png")
black = max(  exam[[1]] ,exam[[2]] , exam[[3]])<50
focalsblack = focal(black, w=3, fun=mean)
plot(focalsblack,useRaster=FALSE)
correct=locator(20)
coordinates=locator(20)
X1=c(73,115,156,199,239)
X2=c(386,428.9,471,510,554)
Y=c(601,536,470,405,341,276,210,145,79,15)
LISTX=c(rep(X1,each=10),rep(X2,each=10))
LISTY=rep(Y,10)
points(LISTX,LISTY,pch=16,col="blue")

The blue points above are where we look for students' answers. Then, we have to define the vector of correct answers,
CORRECTX=c(X1[c(2,4,1,3,1,1,4,5,2,2)],
X2[c(2,3,4,2,1,1,1,2,5,5)])
CORRECTY=c(Y,Y)
points(CORRECTX, CORRECTY,pch=16,col="red",cex=1.3)
UNCORRECTX=c(X1[rep(1:5,10)[-(c(2,4,1,3,1,1,4,5,2,2)
+seq(0,length=10,by=5))]],
X2[rep(1:5,10)[-(c(2,3,4,2,1,1,1,2,5,5)
+seq(0,length=10,by=5))]])
UNCORRECTY=c(rep(Y,each=4),rep(Y,each=4))
Now, let us get back on red areas in the form filled by the student, identified earlier,
exam = stack("C:\\Users\\exam-filled.png")
red = (exam[[1]]>150)&(max(  exam[[2]] , exam[[3]])<150)
focalsred = focal(red, w=5, fun=mean)

Here, we simply have to compare what the student answered with areas where we expect to find some red in,
ind=which(values(focalsred)>.3)
yind=750-trunc(ind/610)
xind=ind-trunc(ind/610)*610
points(xind,yind,pch=19,cex=.4,col="blue")
points(CORRECTX, CORRECTY,pch=1,
col="red",cex=1.5,lwd=1.5)
Crosses on the graph on the right below are the answers identified as correct (here 13),
> icorrect=values(red)[(750-CORRECTY)*
+ 610+(CORRECTX)]
> points(CORRECTX[icorrect], CORRECTY[icorrect],
+ pch=4,col="black",cex=1.5,lwd=1.5)
> sum(icorrect)
[1] 13

In the case there are negative points for non-correct answer, we can count how many incorrect answers we had. Here 4.
> iuncorrect=values(red)[(750-UNCORRECTY)*610+
+ (UNCORRECTX)]
> sum(iuncorrect)
[1] 4
So I have not been able to find Waldo, but I least, that will probably save me hours next time I have to mark exams...

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics - Tag - R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Revolution Newsletter: May 2012

at by David Smith on Revolutions

The most recent edition of the Revolution Newsletter is out. The news section is below, and you can read the full May edition (with highlights from this blog and community events) online. You can subscribe to the Revolution Newsletter to get it monthly via email.

New R Training Courses Announced. Three new R courses from leading R experts are now available for registration:

  • An Introduction to R for SAS, SPSS, and Stata Users will be presented by Bob Muenchen (author of R for SAS and SPSS users) June 26-29. This is an on-line workshop with live instruction that you can attend from your desk - no travel necessary!
  • Visualization in R with ggplot2 is also a new web-based course with live instruction from Garrett Grolemund & Dr. Winston Chang of Rice University, presented online June 19-20.
  • R Development Master Class will be presented in-person by R package author and professor Hadley Wickham, June 21-22 in NYC and June 28-29 in Redwood City, CA.

Sign up to these courses now to secure your seat -- attendance is limited!

On the hunt for a cure for MS. Researchers at SUNY Buffalo are using big-data analytics and generic data toresearch a cure for Multiple Sclerosis. Read in Forbes how Revolution R and IBM Netezza speeds up their research, or join the #IBMDataChat discussion on Twitter at noon ET on Thursday, May 10 to learn more.

More free R webinars. Revolution Analytics' Spring webinar series continues, with upcoming presentations on Getting Up to Speed with R,  spatial statistics with Rdata mining with R, Achieving High-Performing, Simulation-Based Operational Risk Measurement with RevoScaleR, and a look at the next version of Revolution R Enterprise.

Catch our archived webinar replays. Missed any of the recent presentations on Big Data with R and Hadoopintegrating R with BI tools and MS Office, or How Big Data is Changing Retail Marketing Analytics? Check our webinar archives for streaming replays and presentation downloads. 

Putting the R in Analytics. feature article in Information Age reveals how new graduates are driving adoption of R in industry.

Upcoming Conferences for R Users. Revolution Analytics is proud to sponsor: R/Finance 2012 (the premier conference for R users in the finance industry) on May 11-12 in Chicago; and the Interface 2012 conference on the future of Statistical Computing, May 16-18 in Houston.

R user conference deadlines approaching. If you're planning to attend useR! 2012 in Nashville but want to avoid late registration fees, the deadline for regular registration is May 12. 

Revolution Analytics: Newsletter Archive

Revolution Newsletter: May 2012

at by David Smith on R-Bloggers

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The most recent edition of the Revolution Newsletter is out. The news section is below, and you can read the full May edition (with highlights from this blog and community events) online. You can subscribe to the Revolution Newsletter to get it monthly via email.

New R Training Courses Announced. Three new R courses from leading R experts are now available for registration:

  • An Introduction to R for SAS, SPSS, and Stata Users will be presented by Bob Muenchen (author of R for SAS and SPSS users) June 26-29. This is an on-line workshop with live instruction that you can attend from your desk - no travel necessary!
  • Visualization in R with ggplot2 is also a new web-based course with live instruction from Garrett Grolemund & Dr. Winston Chang of Rice University, presented online June 19-20.
  • R Development Master Class will be presented in-person by R package author and professor Hadley Wickham, June 21-22 in NYC and June 28-29 in Redwood City, CA.

Sign up to these courses now to secure your seat -- attendance is limited!

On the hunt for a cure for MS. Researchers at SUNY Buffalo are using big-data analytics and generic data toresearch a cure for Multiple Sclerosis. Read in Forbes how Revolution R and IBM Netezza speeds up their research, or join the #IBMDataChat discussion on Twitter at noon ET on Thursday, May 10 to learn more.

More free R webinars. Revolution Analytics' Spring webinar series continues, with upcoming presentations on Getting Up to Speed with R,  spatial statistics with Rdata mining with R, Achieving High-Performing, Simulation-Based Operational Risk Measurement with RevoScaleR, and a look at the next version of Revolution R Enterprise.

Catch our archived webinar replays. Missed any of the recent presentations on Big Data with R and Hadoopintegrating R with BI tools and MS Office, or How Big Data is Changing Retail Marketing Analytics? Check our webinar archives for streaming replays and presentation downloads. 

Putting the R in Analytics. feature article in Information Age reveals how new graduates are driving adoption of R in industry.

Upcoming Conferences for R Users. Revolution Analytics is proud to sponsor: R/Finance 2012 (the premier conference for R users in the finance industry) on May 11-12 in Chicago; and the Interface 2012 conference on the future of Statistical Computing, May 16-18 in Houston.

R user conference deadlines approaching. If you're planning to attend useR! 2012 in Nashville but want to avoid late registration fees, the deadline for regular registration is May 12. 

Revolution Analytics: Newsletter Archive

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

And the Winner is……

at by Dave Giles on R-Bloggers

(This article was first published on Econometrics Beat: Dave Giles' Blog, and kindly contributed to R-bloggers)

R will overtake SAS and SPSS in 2015 - according to David Smith in his post on the Revolutions blog.

I can believe that!

© 2012, David E. Giles

To leave a comment for the author, please follow the link and comment on his blog: Econometrics Beat: Dave Giles' Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Stepping Outside My Open-Source Comfort Zone: A First Look at Golden Helix SVS

at by Stephen Turner on R-Bloggers

(This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers)

I'm a huge supporter of the Free and Open Source Software movement. I've written more about R than anything else on this blog, all the code I post here is free and open-source, and a while back I invited you to steal this blog under a cc-by-sa license.

Every now and then, however, something comes along that just might be worth paying for. As a director of a bioinformatics core with a very small staff, I spend a lot of time balancing costs like software licensing versus personnel/development time, so that I can continue to provide a fiscally sustainable high-quality service.

As you've likely noticed from my more recent blog/twitter posts, the core has been doing a lot of gene expression and RNA-seq work. But recently had a client who wanted to run a fairly standard case-control GWAS analysis on a dataset from dbGaP. Since this isn't the focus of my core's service, I didn't want to invest the personnel time in deploying a GWAS analysis pipeline, downloading and compiling all the tools I would normally use if I were doing this routinely, and spending hours on forums trying to remember what to do with procedural issues such as which options to specify when running EIGENSTRAT or KING, or trying to remember how to subset and LD-prune a binary PED file, or scientific issues, such as whether GWAS data should be LD-pruned at all before doing PCA.

Golden Helix

A year ago I wrote a post about the "Hitchhiker's Guide to Next-Gen Sequencing" by Gabe Rudy, a scientist at Golden Helix. After reading this and looking through other posts on their blog, I'm confident that these guys know what they're doing and it would be worth giving their product a try. Luckily, I had the opportunity to try out their SNP & Variation Suite (SVS) software (I believe you can also get a free trial on their website).

I'm not going to talk about the software - that's for a future post if the core continues to get any more GWAS analysis projects. In summary - it was fairly painless to learn a new interface, import the data, do some basic QA/QC, run a PCA-adjusted logistic regression, and produce some nice visualization. What I want to highlight here is the level of support and documentation you get with SVS.

Documentation

First, the documentation. At each step from data import through analysis and visualization there's a help button that opens up the manual at the page you need. This contextual manual not only gives operational details about where you click or which options to check, but also gives scientific explanations of why you might use certain procedures in certain scenarios. Here's a small excerpt of the context-specific help menu that appeared when I asked for help doing PCA.

What I really want to draw your attention to here is that even if you don't use SVS you can still view their manual online without registering, giving them your e-mail, or downloading any trialware. Think of this manual as an always up-to-date mega-review of GWAS - with it you can learn quite a bit about GWAS analysis, quality control, and statistics. For example, see this section on haplotype frequency estimation and the EM algorithm. The section on the mathematical motivation and implementation of the Eigenstrat PCA method explains the method perhaps better than the Eigenstrat paper and documentation itself. There are also lots of video tutorials that are helpful, even if you're not using SVS. This is a great resource, whether you're just trying to get a better grip on what PLINK is doing, or perhaps implementing some of these methods in your own software.

Support

Next, the support. After installing SVS on both my Mac laptop and the Linux box where I do my heavy lifting, one of the product specialists at Golden Helix called me and walked me through every step of a GWAS analysis, from QC to analysis to visualization. While analyzing the dbGaP data for my client I ran into both software-specific procedural issues as well as general scientific questions. If you've ever asked a basic question on the R-help mailing list, you know need some patience and a thick skin for all the RTFM responses you'll get. I was able to call the fine folks at Golden Helix and get both my technical and scientific questions answered in the same day. There are lots of resources for getting your questions answered, such as SEQanswers, Biostar, Cross Validated, and StackOverflow to name a few, but getting a forum response two days later from "SeqGeek96815" doesn't compare to having a team of scientists, statisticians, programmers, and product specialists on the other end of a telephone whose job it is to answer your questions.

Final Thoughts

This isn't meant to be a wholesale endorsement of Golden Helix or any other particular software company - I only wanted to share my experience stepping outside my comfortable open-source world into the walled garden of a commercially-licensed software from a for-profit company (the walls on the SVS garden aren't that high in reality - you can import and export data in any format imaginable). One of the nice things about command-line based tools is that it's relatively easy to automate a simple (or at least well-documented) process with tools like Galaxy, Taverna, or even by wrapping them with perl or bash scripts. However, the types of data my clients are collecting and the kinds of questions they're asking are always a little new and different, which means I'm rarely doing the same exact analysis twice. Because of the level of documentation and support provided to me, I was able to learn a new interface to a set of familiar procedures and run an analysis very quickly and without spending hours on forums figuring out why a particular program is seg-faulting. Will I abandon open-source tools like PLINK for SVS, Tophat-Cufflinks for CLC Workbench, BWA for NovoAlign, or R for Stata? Not in a million years. I haven't talked to Golden Helix or some of the above-mentioned companies about pricing for their products, but if I can spend a few bucks and save the time it would taken a full time technician at $50k+/year to write a new short read aligner or build a new SNP annotation database server, then I'll be able to provide a faster, high-quality, fiscally sustainable service at a much lower price for the core's clients, which is all-important in a time when federal funding is increasingly harder to come by.

To leave a comment for the author, please follow the link and comment on his blog: Getting Genetics Done.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Using R to graph a subject trend in PubMed

at by David Ruau on R-Bloggers

(This article was first published on Brain Chronicle, and kindly contributed to R-bloggers)

The traditional way to show that your topic is worth studying in front of an audience is to show the state of the field based on a literature review. This is especially true if your subject is obscure except to a handful of scientists in the world.
I was confronted with this problem more than once and the last time I decided to plot the state-of-the-field using a few scripts.
I wrote three scripts for that: pubmed_trend.r that take your PubMed query and send it to the NCBI using the Eutils tools (Perl script). Then I plot the results. The details of the scripts is below but here how you create your trend.
Here we plot the trend for the number of publications per year for papers annotated with MeSH terms for "sex characteristics" and "pain" and compare this search to the number of publication/year for "sex characteristics" and "Analgesics". We will run this search between 1970 and 2011. And here is the plot.
What we see here is that the number of publications per year talking about sex difference and pain or analgesics is growing but the number of publication per year is still small and more research is needed.
...and you are good to go, your talk is launch

Here is the details of the scripts and functions. The pubmed_trend.r take a query string as you would type it in the search box through the web interface (space have to be replaced by '+').
The Perl script is straight forward and return an XML file that is parsed by the XML library in R. And here is the plot function using barplot.

To leave a comment for the author, please follow the link and comment on his blog: Brain Chronicle.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

How long before R overtakes SAS and SPSS?

at by David Smith on Revolutions

Based on an analysis of Google Scholar data on usage of statistical software, Bob Muenchen makes a forecast: R will overtake SAS and SPSS in 2015. Forecasting is extrapolation — always a tricky business — so Bob also provides these qualitative reasons why R will continue to grow at the expense of SAS and SPSS:

  • The continued rapid growth in add-on packages (Figure 10)
  • The attraction of R’s powerful language
  • The near monopoly R has on the latest analytic methods
  • Its free price
  • The freedom to teach with real-world examples from outside organizations, which is forbidden to academics by SAS and SPSS licenses (it benefits those organizations, so the vendors say they should have their own software license).

See how Bob comes up with this forecast (using R, of course!) at the link below.

r4stats.com: Will 2015 be the Beginning of the End for SAS and SPSS?

How long before R overtakes SAS and SPSS?

at by David Smith on R-Bloggers

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Based on an analysis of Google Scholar data on usage of statistical software, Bob Muenchen makes a forecast: R will overtake SAS and SPSS in 2015. Forecasting is extrapolation — always a tricky business — so Bob also provides these qualitative reasons why R will continue to grow at the expense of SAS and SPSS:

  • The continued rapid growth in add-on packages (Figure 10)
  • The attraction of R’s powerful language
  • The near monopoly R has on the latest analytic methods
  • Its free price
  • The freedom to teach with real-world examples from outside organizations, which is forbidden to academics by SAS and SPSS licenses (it benefits those organizations, so the vendors say they should have their own software license).

See how Bob comes up with this forecast (using R, of course!) at the link below.

r4stats.com: Will 2015 be the Beginning of the End for SAS and SPSS?

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Interactive reports in R with knitr and RStudio

at by Markus Gesmann on R-Bloggers

(This article was first published on mages' blog, and kindly contributed to R-bloggers)

Last Saturday I met the guys from RStudio at the R in Finance conference in Chicago. I was curious to find out what RStudio could offer. In the past I have used mostly Emacs + ESS for editing R files. Well, and what a surprise it was. JJ, Joe and Josh showed me a preview of version 0.96 of their software, which adds a close integration of Sweave and knitr to RStudio, helping to create dynamic web reports with the new R Markdown and R HTML formats more easily.

Screen shot of RStudio with a knitr file (*.Rmd) in the top left window.
Notice also the integrated knitr button.

You probably have come across Sweave in the past, but knitr is a fairly new package by Yihui Xie that brings literate programming to a new level. In particular the markdown approach allows me to create web content really quickly, without worrying to much about layout and R formatting. I begin to wonder if PDF and paper will be replaced by tablets and HTML5 in the future.

Here is a simple example. The knitr source code is available on Github.

Loading ...

To leave a comment for the author, please follow the link and comment on his blog: mages' blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Will 2015 be the Beginning of the End for SAS and SPSS?

at by r4stats on R-Bloggers

(This article was first published on r4stats.com » R, and kindly contributed to R-bloggers)

Learning to use a data analysis tool well takes significant effort, so people tend to continue using the tool they learned in college for much of their careers. As a result, the software used by professors and their students is likely to predict what the next generation of analysts will use for years to come. I track this trend, and many others, in my article The Popularity of Data Analysis Software. In the latest update (4/13/2012) I forecast that, if current trends continued, the use of the R software would exceed that of SAS for scholarly applications in 2015. That was based on the data shown in Figure 7a, which I repeat here:

Let’s take a more detailed look at what the future may hold for R, SAS and SPSS Statistics.

Here is the data from Google Scholar:

         R   SAS   SPSS
1995     8  8620   6450
1996     2  8670   7600
1997     6 10100   9930
1998    13 10900  14300
1999    26 12500  24300
2000    51 16800  42300
2001   133 22700  68400
2002   286 28100  88400
2003   627 40300  78600
2004  1180 51400 137000
2005  2180 58500 147000
2006  3430 64400 142000
2007  5060 62700 131000
2008  6960 59800 116000
2009  9220 52800  61400
2010 11300 43000  44500
2011 14600 32100  32000

ARIMA Forecasting

We can forecast the use of R using Rob Hyndman’s handy auto.arima function to forecast five years into the future:

> library("forecast")

> R_fit <- auto.arima(R)

> R_forecast <- forecast(R_fit, h=5)

> R_forecast

   Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
18          18258 17840 18676 17618 18898
19          22259 21245 23273 20709 23809
20          26589 24768 28409 23805 29373
21          31233 28393 34074 26889 35578
22          36180 32102 40258 29943 42417

We see that even if the use of SAS and SPSS were to remain at their current levels, R use would surpass their use in 2016 (Point Forecast column where 18-22 represent years 2012 -2016).

If we follow the same steps for SAS we get:

> SAS_fit <- auto.arima(SAS)

> SAS_forecast <- forecast(SAS_fit, h=5)

> SAS_forecast

   Point Forecast     Lo 80   Hi 80    Lo 95 Hi 95
18          21200  16975.53 25424.5  14739.2 27661
19          10300    853.79 19746.2  -4146.7 24747
20           -600 -16406.54 15206.5 -24774.0 23574
21         -11500 -34638.40 11638.4 -46887.1 23887
22         -22400 -53729.54  8929.5 -70314.4 25514

It appears that if the use of SAS continues to decline at its precipitous rate, all scholarly use of it will stop in 2014 (the number of articles published can’t be less than zero, so view the negatives as zero). I would bet Mitt Romney $10,000 that that is not going to happen!

I find the SPSS prediction the most interesting:

> SPSS_fit <- auto.arima(SPSS)

> SPSS_forecast <- forecast(SPSS_fit, h=5)

> SPSS_forecast

   Point Forecast   Lo 80 Hi 80   Lo 95  Hi 95
18        13653.2  -16301 43607  -32157  59463
19        -4693.6  -57399 48011  -85299  75912
20       -23040.4 -100510 54429 -141520  95439
21       -41387.2 -145925 63151 -201264 118490
22       -59734.0 -193590 74122 -264449 144981

The forecast has taken a logical approach of focusing on the steeper decline from 2005 through 2010 and predicting that this year (2012) is the last time SPSS will see use in scholarly publications. However the part of the graph that I find most interesting is the shift from 2010 to 2011, which shows SPSS use still declining but at a much slower rate.

Any forecasting book will warn you of the dangers of looking too far beyond the data and I think these forecasts do just that. The 2015 figure in the Popularity paper and in the title of this blog post came from an exponential smoothing approach that did not match the rate of acceleration as well as the ARIMA approach does.

Colbert Forecasting

While ARIMA forecasting has an impressive mathematical foundation it’s always fun to follow Stephen Colbert’s approach: go from the gut. So now I’ll present the future of analytics software that must be true, because it feels so right to me personally. This analysis has Colbert’s most important attribute: truthiness.

The growth in R’s use in scholarly work will continue for two more years at which point it will level off at around 25,000 articles in 2014.This growth will be driven by:

  • The continued rapid growth in add-on packages (Figure 10)
  • The attraction of R’s powerful language
  • The near monopoly R has on the latest analytic methods
  • Its free price
  • The freedom to teach with real-world examples from outside organizations, which is forbidden to academics by SAS and SPSS licenses (it benefits those organizations, so the vendors say they should have their own software license).

What will slow R’s growth is its lack of a graphical user interface that:

  • Is powerful
  • Is easy to use
  • Provides journal style output in word processor format
  • Is standard, i.e. widely accepted as The One to Use
  • Is open source

While programming has important advantages over GUI use, many people will not take the time needed to learn to program. Therefore they rarely come to fully understand those advantages. Conversely, programmers seldom take the time to fully master a GUI and so often underestimate its capabilities. Regardless of which is best, GUI users far outnumber programmers and, until resolved, this will limit R’s long term growth. There are GUIs for R, but so many to choose from that none becomes the clear leader (Deducer, R Commander, Rattle, Red-R, at least two from commercial companies and still more here.) If from this “GUI chaos” a clear leader were to emerge, then R could continue its rapid growth and end up as the most used package.

The use of SAS for scholarly work will continue to decline until it matches R at the 25,000 level. This is caused by competition from R and other packages (notably Stata) but also by SAS Instute’s self-inflicted GUI chaos.  For years they have offered too many GUIs such as SAS/Assist, SAS/Insight, IML/Studio, the Analyst application, Enterprise Guide, Enterprise Miner and  even JMP (which runs SAS nicely in recent versions). Professors looking to meet student demand for greater ease of use could not decide what to teach so they continued teaching SAS as a programming language. Even now that Enterprise Guide has evolved into a good GUI, many SAS users do not know what it is. If SAS Institute were to completely replace their default Display Manager System with Enterprise Guide, they could bend the curve and end up at a higher level of perhaps 27,000.

The use of SPSS for scholarly work will decline only slightly this year and will level off in 2013 because:

  • The people who needed advanced methods and were not happy calling R functions from within SPSS have already switched to R or Stata
  • The people who like to program and want a more flexible language than SPSS offers have already switched to R or Stata
  • The people who needed a more advanced GUI have already switched to JMP

The GUI users will stick with SPSS until a GUI as good (or close to as good) comes to R and becomes widely accepted. At The University of Tennessee where I work, that’s the great majority of SPSS users.

Stata’s growth will level off in 2013 at level that will leave it in fourth place. The other packages shown in Figure 7b will also level off around the same time, roughly maintaining their current place in the rankings. A possible exception is JMP, whose interface is radically superior to the the others for exploratory analysis. Its use could continue to grow, perhaps even replacing Stata for fourth place.

The future of Enterprise Miner and SPSS Modeler are tied to the success of each company’s more mainstream products, SAS and SPSS Statistics respectively. Use of those products is generally limited to one university class in data mining, while the other software discussed here is widely used in many classes.

So there you have it: the future of analytics revealed. No doubt each reader has found a wide range of things to disagree with, so I encourage you to follow the detailed blog at Librestats to collect your own data from Google Scholar and do your own set of forecasts. Or simply go from the gut!


To leave a comment for the author, please follow the link and comment on his blog: r4stats.com » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Forthcoming R User Meetings

at by Tal Galili on R-Bloggers

Two R User Group meetings are happening soon thanks to the support of Mango-solutions (one of R-bloggers’ long term sponsors).  Details below:

 

1)      ZurichR – Wednesday 23rd May 2012 (www.zurichr.org)

ZurichR is a free networking event for R users sponsored by Mango Solutions and ETH Zurich

All welcome to attend.  Please confirm attendance in advance to zurichr@mango-solutions.com

Time:       6.30pm – 9.30 pm

Place:      ETH Zurich, Room HG F 50.3, Rämistrasse 101, 8092 Zurich

Presentations:

  • ·         R with COM; the dark side of .NET – John James, Mango Solutions
  • ·         Stability Analytics of Financial Time Series – Diethelm Wuertz, ETH
  • ·         Transparent Investment Management – Simon Otziger, Corepoint Capital

 

2)      LondonR – 19th June 2012 (www.londonr.org)

 

LondonR is a free networking event for R users sponsored by Mango Solutions

All welcome to attend.  Please confirm attendance in advance to londonr@mango-solutions.com

Venue:  The Counting House, 50 Cornhill, London, London EC3V 3PD (the meeting room is upstairs)

Time:     6pm – 9.30 pm

Presentations:

  • ·         Writing R for Dummies – Andrie De Vries
  • ·         Dynamical Systems in R with simecol – Markus Gesmann
  • ·         News from data.table 1.6, 1.7 and 1.8 – Matthew Dowle
  • ·         Converting S Plus Applications to R – Andy Nicholls

 

 

 

 

Skew of Bonds

at by klr on R-Bloggers

(This article was first published on Timely Portfolio, and kindly contributed to R-bloggers)

As the researchpuzzler highlights in “a bad bet”, US bonds were a popular subject at the CFA Institute Annual Conference.  While US Bonds have been in an amazing 30 year run (see previous posts Lattice Explore Bonds, Bond Market as a Casino Game Part 1, Calmar Ratio 1.37 over the past 20 years), I think many positive skew-chasing market participants are not aware of the frequency of negative skew in bond returns.  As a public service, I thought I should issue a negative skew alert.

From TimelyPortfolio

R code from GIST:

To leave a comment for the author, please follow the link and comment on his blog: Timely Portfolio.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Improving script_002: “Monitor”

at by jrcuesta on R-Bloggers

(This article was first published on NIR-Quimiometría, and kindly contributed to R-bloggers)

I read in an article that Ian Cowe said that what normally chemometricians do is to look to the graphics, of course interpret those graphics. So I still go on trying to develop a function can help me to understand the graphics and all the statistics there are behind.
I add some more lines to the monitor function:
plot(x~y,main="X-Y plot",xlab="predicted",ylab="reference")
abline(0,1,col="blue")
abline(intercept,slope,col="red")
abline(intercept+(2.5*sep),slope,col="red",lty=4)
abline(intercept-(2.5*sep),slope,col="red",lty=4)
I can do the same for the residual plot.
There are two warning lines which advice if the residual exceeds 2,5*SEP value. That is the T value warning limit.
Another line can be add, called the T value action limit (3*SEP).
Graphics show the 0-1 abline (blue) and the calculated slope-intercept abline (red). Limits are with dashed red lines.
We can see that almost both lines red and blue are almost over-plotted in this case.

To leave a comment for the author, please follow the link and comment on his blog: NIR-Quimiometría.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

RcppSMC 0.1.1

at by Thinking inside the box on R-Bloggers

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

CRAN now tests packages against g++-4.7 (as this version has become the default on Debian's testing variant. This compiler switch once again triggered a set of build failures, mostly from include files now deemed missing. For RcppSMC, it came down to a five-character patch of explicitly stating one max() call as std::max()

No other changes were made at this point. The NEWS entry is below:

0.1.1   2012-05-14

    o   Version 0.1.1 

    o   Minor g++-4.7 build fix of using std::max() explicitly

Courtesy of CRANberries, there is also a diffstat report for 0.1.1 relative to 0.1.0 As always, more detailed information is on the RcppSMC page,

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

GitHub data analysis

at by Dzidorius Martinaitis on R-Bloggers

(This article was first published on Quantitative thoughts » EN, and kindly contributed to R-bloggers)

Few weeks ago GitHub announced, that its timeline data is available on bigquery for analysis. Moreover, it offers prizes for the best visualization of the data. Despite my art skills and minimal chances to win beauty contest, I decided to crunch GitHub data and run data analysis.

After initial trial of bigquery service, I found hard to know, what price, if any, I’m going to pay for the service. Hence, I pulled the data (6.5 GB) from bigquery on my machine and further I used my machine for analysis. Bash scripts have been used to clean up and extract necessary data, R for data analysis and visualization and C++ for text extraction.

GitHub dataset is one table, where each row consist of information about repository (i.e. path, date of creation, name, description, programming language, number of forks/watchers and etc.) and action, which was done by user (i.e. username, location, timestamp and etc.).

As a result, we can check how GitHub users actions are spread over time during the day. The X axis on the graph below is labeled with the hours of the day (GMT) and the Y axis represent median values of the actions for each hour. From it, we can make a deduction, that highest load for GitHub can be expected between 15:00 and 17:00 GMT and lowest to be expected between 05:00 and 07:00 GMT. The color of the line indicates how busy was the day based on quantiles: green are calm days (20% of days), blue – normal days (50% quantile) and red are busy days (80% quantile). I should to mention, that auto-correlation or serial correlation is high (70% for following hour), which means, that busy hours tend to be followed by busy hours and calm hours tend to be followed by calm hours. Moreover, busy days tend happen after busy days.

Photobucket

Second graph below shows median of actions divided by weekdays. There is not big surprise – weekends are more slow than weekdays, nevertheless the programmers are slightly less productive on Mondays and Fridays.

Photobucket

The analysis of creation of new repository shows, that the pattern of busy or calm hours remains over the years. This can be attributed to the fact, that majority of the users comes from North America and Europe.
Another hypothesis can be drawn from this information, that number of creation of the new repositories grow exponentially. However, I mind you, that the graph below is biased – most likely, GitHub users update recent projects, consequently more recent projects appeared on timeline. Even though, 2009-2011 years show exponential grow.
The X axis of the graph below is labeled with the hour of the day, the Y axis – log of median values of new repositories.

Photobucket

Following graph shows the number of forks per project (the X axis, log scale) versus number of watchers (the Y axis, log scale). As expected, there is linear correlation between forks and watchers. Even so there is something interesting about outliers, which are below bottom line – the projects, where number of watchers is low, but number of forks is high. These are anomalies and worth to check.

Photobucket

The next thing to do is to look at the repository description. Let’s group the repositories by programming language and count most dominant words in the description. The graph below has C++ word cloud on the left and Java – right . C++ projects are about library, game, simple(?), engine, Arduino. Java is dominated by android, plugin, server, minecraft, spring, maven.

Photobucket
Ruby (left) vs Python(right ):
Photobucket

“Surprise”, “surprise” – R projects (left) are largely about data analysis, however “machine” word, which corresponds to Machine learning is very tiny. Shell (right) is dominated by configuration, managing, git(?).

Photobucket

GitHub dataset includes location field. Unfortunately, the users can enter whatever they want – country, city or leave it empty. Nevertheless, I was able to extract good chunk of actions, where location field has meaningful value.  The video below shows country based users activity, where dark red corresponds to high activity and light red – minor. Only 30 most active countries are included, the rest are grey.
The same pattern persist over the days – activity in Asia increases around midnight, Europe wakes up around 8:00 or 9:00, where America starts around 15:00. Who said, that hackers and programmers work at night?

 

What else can be done with GitHub dataset? Most repositories have description field, which can be used to find similar projects by implementing tf-idf method. I tried that method and the results are satisfying.

Most of the graphs shown above are reproducible (except word clouds) and the code can be found on GitHub.

To leave a comment for the author, please follow the link and comment on his blog: Quantitative thoughts » EN.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Blog aggregators

at by Rob J Hyndman on R-Bloggers

(This article was first published on Research tips » R, and kindly contributed to R-bloggers)

A very useful way of keeping up with blogs in a particular area is to subscribe to a blog aggregator. These will syndicate posts from a large number of blogs and provide links back to the original sources. So you only need to subscribe once to get all the good stuff in that area.

There are now several blog aggregators available that might be of interest to readers here. And this blog is now syndicated on several other sites including those listed below.

  • R-bloggers: for all R-related blogs. The posts tagged R from this blog are syndicated there along with about 300 other R blogs.
  • Statsblogs for statistical blogs. This is a very new aggregator, but is growing fast. There is naturally some overlap with R-bloggers. All posts from this blog are syndicated there.
  • TeX community for TeX related blogs. The posts tagged LaTeX from this blog are syndicated there along with about 40 other TeX blogs.
  • Mathblogging.org aggregates a number of mathematics blogs. All posts from this blog are syndicated there.

In addition, for those interested in economics, EconAcademics.org aggregates a number of economics blogs (but does not include this blog as I rarely post anything about economics).

I have also set up two aggregations for new research papers:

So if all the blogs around are overwhelming, and you don’t know where to start, select one or two of these aggregators and you’re off and running.

If you’ve no idea how to subscribe to a blog, see my post on using Google Reader.

To leave a comment for the author, please follow the link and comment on his blog: Research tips » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Multiple Sclerosis Tweet-Chat: Review

at by David Smith on Revolutions

We had a great Twitter conversation last Thursday on the use of big-data analytics, Revolution R Enterprise, and IBM Netezza in the search for a cure for MS. Many thanks to the other panelists: Murali Ramanathan (SUNY Buffalo), Tim Coetzee (National MS Society) and moderator Shawn Dolley (IBM) for fielding and answering questions from interested parties following #IBMDataChat. As you can see from this twitteR analysis, it was a lively discussion, with more than 300 tweets during the designated hour:

Tweetchat

IBM's James Kobielus has a summary of the chat, highlighting some of the key nuggets of information. For example, Dr Ramanathan revealed that this research is helping to understand interactions between genetic factors and environment that could help identify lifestyle and diet changes to manage MS. 

Also, in this interview with IBM's Mike Kearney I explain why the R language is uniquely suited for research like this, and how Revolution R works with the massive data volumes required for this study.

Thinking Inside the Box: Are you ready for big data? R is.

Multiple Sclerosis Tweet-Chat: Review

at by David Smith on R-Bloggers

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

We had a great Twitter conversation last Thursday on the use of big-data analytics, Revolution R Enterprise, and IBM Netezza in the search for a cure for MS. Many thanks to the other panelists: Murali Ramanathan (SUNY Buffalo), Tim Coetzee (National MS Society) and moderator Shawn Dolley (IBM) for fielding and answering questions from interested parties following #IBMDataChat. As you can see from this twitteR analysis, it was a lively discussion, with more than 300 tweets during the designated hour:

Tweetchat

IBM's James Kobielus has a summary of the chat, highlighting some of the key nuggets of information. For example, Dr Ramanathan revealed that this research is helping to understand interactions between genetic factors and environment that could help identify lifestyle and diet changes to manage MS. 

Also, in this interview with IBM's Mike Kearney I explain why the R language is uniquely suited for research like this, and how Revolution R works with the massive data volumes required for this study.

Thinking Inside the Box: Are you ready for big data? R is.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

New courses from R gurus

at by David Smith on Revolutions

Looking to learn R, or to expand your R skills for data visualization or package development? Here are some R courses presented by the experts you may be interested in:

June 19-20Visualization in R with ggplot2. This course presented by Garrett Grolemund & Dr. Winston Chang of Rice University is also a web-based course with live presentation. This course provides instruction on data visualization with R, including data transformation, visualization of Big Data and polishing graphics for presentation.  

June 21-22 (in New York City) and June 28-29 (in Redwood City, CA): R Development Master Class is an in-person, in-depth R course presented R package author and professor Hadley Wickham. This two-day course offers expert instruction in R programming and package development, and is ideal for anyone looking to hone their R development skills with expert instruction.

June 26-29: An Introduction to R for SAS, SPSS, and Stata Users. This course will be presented by Bob Muenchen, author of R for SAS and SPSS users and R for Stata Users. This course will let you build on your existing statistical software skills to learn the R language -- all without leaving your desk. The course features live presentation and student/teacher interaction via a web-based course delivery system.

For more information about these courses (including pricing and registration), please follow the link below.

Revolution Analytics: Public Training Courses

New courses from R gurus

at by David Smith on R-Bloggers

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Looking to learn R, or to expand your R skills for data visualization or package development? Here are some R courses presented by the experts you may be interested in:

June 19-20Visualization in R with ggplot2. This course presented by Garrett Grolemund & Dr. Winston Chang of Rice University is also a web-based course with live presentation. This course provides instruction on data visualization with R, including data transformation, visualization of Big Data and polishing graphics for presentation.  

June 21-22 (in New York City) and June 28-29 (in Redwood City, CA): R Development Master Class is an in-person, in-depth R course presented R package author and professor Hadley Wickham. This two-day course offers expert instruction in R programming and package development, and is ideal for anyone looking to hone their R development skills with expert instruction.

June 26-29: An Introduction to R for SAS, SPSS, and Stata Users. This course will be presented by Bob Muenchen, author of R for SAS and SPSS users and R for Stata Users. This course will let you build on your existing statistical software skills to learn the R language -- all without leaving your desk. The course features live presentation and student/teacher interaction via a web-based course delivery system.

For more information about these courses (including pricing and registration), please follow the link below.

Revolution Analytics: Public Training Courses

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

New Version of RStudio (v0.96)

at by jjallaire on R-Bloggers

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

Today a new version of RStudio (v0.96) is available for download from our website. The main focus of this release is improved tools for authoring, reproducible research, and web publishing. This means lots of new Sweave features as well as tight integration with the knitr package (including support for creating dynamic web reports with the new R Markdown and R HTML formats).

We’ve also added some other frequently requested editing features including code folding. Here’s a short video demo of the new authoring and web publishing features:

RStudio v0.96 Demo

We’re particularly excited about the new possibilities opened up by R Markdown, which make it easier than ever to create web content with R. On June 5th in New York we’ll talking about the latest releases of knitr and RStudio with Yihui Xie (knitr) and Jeff Horner (R/Apache and Rook):

http://www.meetup.com/nyhackr/events/64279002/

We’ll also be announcing some more new stuff at the meetup—hope to see you there!

You can download RStudio 0.96 from our website now. Here’s a list of all the new features:

Sweave / knitr

  • Spell checking for Sweave and TeX documents.
  • Integrated PDF previewer that supports two-way synchronization (SyncTeX) between the editor and PDF view.
  • Support for weaving Rnw files using the knitr package (requires knitr version 0.5 or higher).
  • Parsing of TeX error logs to extract errors, warnings, and bad boxes and present them in a navigable list.
  • Chunk option auto-complete, chunk folding, jump to chunk, and iterative execution of chunks.
  • Compilation based on multiple input files (support for specifying a root TeX document) .
  • TeX formatting commands, block comment/uncomment, and various new compilation options.

Web Publishing

  • Editing and previewing R Markdown and R HTML files (like Sweave except for web pages).
  • Creation of easy to distribute standalone HTML files (with embedded images).
  • Support for including LaTeX, ASCIIMath, and MathML equations in web pages using MathJax.

Source Editing

  • Find in files with regular expressions.
  • Code folding (expanding and collapsing regions of code).
  • Automatic comment reflowing (Cmd+Shift+/).
  • Smart editing of Roxygen comments.
  • Syntax highlighting for Markdown, HTML, Javascript, and CSS files.
  • New font customization options.

Miscellaneous

  • Fixed incompatibility with Winbind for PAM authentication.
  • Fixed editor cursor off by one line problem that occurred after rapid scrolling.

To leave a comment for the author, please follow the link and comment on his blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Criticism 3 of NHST: Essential Information is Lost When Transforming 2D Data into a 1D Measure

at by John Myles White on R-Bloggers

(This article was first published on John Myles White » Statistics, and kindly contributed to R-bloggers)

Introduction

Continuing on with my series on the weaknesses of NHST, I’d like to focus on an issue that’s not specific to NHST, but rather one that’s relevant to all quantitative analysis: the destruction caused by an inappropriate reduction of dimensionality. In our case, we’ll be concerned with the loss of essential information caused by the reduction of a two-dimensional world of uncertain measurements into the one-dimensional world of p-values.

p-Values Mix Up the Strength of Effects with the Precision of Their Measurement

For NHST, the two independent dimensions of measurement are (1) the strength of an effect, measured using the distance of a point estimate from zero; and (2) the uncertainty we have about the effect’s true strength, measured using something like the expected variance of our measurement device. These two dimensions are reduced into a single p-value in a way that discards much of the meaning of the original data.

When using confidence intervals, the two dimensions I’ve described are equivalent to the position of the center of the confidence interval relative to zero and the width of the confidence interval. Clearly these two dimensions can vary independently. Working through the math, it is easy to show that p-values are simply a one-dimensional representation of these two dimensions.1

To illustrate how many different kinds of data sets receive the same p-value under NHST, let’s consider three very different data sets in which we test for a difference across two groups and then get the same p-value out of our analysis:

three_studies.png

Clearly these data sets are substantively different, despite producing identical p-values. Really, we’ve seen three qualitatively different types of effects under study:

  1. An effect that is probably trivial, but which has been measured with considerable precision.
  2. An effect with moderate importance that has been measured moderately well.
  3. An effect that could be quite important, but which has been measured fairly poorly.

No one can argue that these situations are not objectively different. Importantly, I think many of us also feel that the scientific merits of these three types of research are very different: we have some use for the last two types of studies and no real use for the first. Sadly, I suspect that the scientific literature increasingly focuses on the first category, because it is always possible to measure anything precisely if you are willing to invest enough time and money. If the community’s metric for scientific quality is a p-value, which can be no more than a statement about the precision of measurements, then you will find that scientists produce precise measurements of banalities rather than tentative measurements of important effects.

How Do We Solve This Problem?

Unlike previous posts, this problem with the use of NHST can be solved without any great effort to teach people to use better methods: to compute a p-value, you need to estimate both the strength of an effect and the precision of its measurement. Moving forward, we must be certain that we report both of these quantities instead of the one-number p-value summary.

Sadly, people have been arguing for this change for years without much success. To solve our impasse, I think we need to push on our community to impose a flat out ban: going forward, researchers should only be allowed to report confidence intervals. Given that p-values can always be derived from the more informative confidence intervals while the opposite transformation is not possible, how compelling could any argument be for continuing to tolerate p-values?

References

Ziliak, S.T. and McCloskey, D.N. (2008), “The cult of statistical significance: How the standard error costs us jobs, justice, and lives”, Univ of Michigan Press

  1. Indeed, p-values are effectively constructed by dividing the distance of the point estimate from zero by the width of the confidence interval and then passing this normalized distance through a non-linear function. I’m particularly bewildered by the use of this non-linear function: most people have trouble interpreting numbers already, and this transformation seems almost designed to make the numbers harder to interpret.

To leave a comment for the author, please follow the link and comment on his blog: John Myles White » Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Example 9.31: Exploring multiple testing procedures

at by Ken Kleinman on R-Bloggers

(This article was first published on SAS and R, and kindly contributed to R-bloggers)

In example 9.30 we explored the effects of adjusting for multiple testing using the Bonferroni and Benjamini-Hochberg (or false discovery rate, FDR) procedures. At the time we claimed that it would probably be inappropriate to extract the adjusted p-values from the FDR method from their context. In this entry we attempt to explain our misgivings about this practice.

The FDR procedure is described in Benjamini and Hochberg (JRSSB, 1995) as a "step-down" procedure. Put simply, the procedure has the following steps:

0. Choose the familywise alpha
1. Rank order the unadjusted p-values
2. Beginning with the Mth of the ordered p-values p(m),
2a. if p(m) alpha*(m/M), then reject all tests 1 ... m,
2b. if not, m = m-1
3. Repeat steps 2a and 2b until the condition is met
or p(1) > alpha/M

where M is the number of tests. The "adjusted p-value" based on this procedure is the smallest familywise alpha under which the current test would have been rejected. To calculate this, we can modify the routine above:

1. Rank order the unadjusted p-values
2. For ordered p-values p(m) M to 1,
2a. candidate ap(m) = p(m) *(M/m)
2b. if candidate ap(m) > ap(m+1) then ap(m) = ap(m+1)
2c. else ap(m) = candidate ap(m)

where ap(m) refers to the adjusted p-value corresponding to the mth ordered unadjusted p-value. It's interesting to note that the adjusted p-value for the Mth ordered test is the same as the unadjusted p-value, while the candidate adjusted p-value for the smallest test is the Bonferroni adjusted p-value. The primary difficulty with taking these p-values (as opposed to the test results) out of context is captured in steps 2b and 2c. They imply that the p-value for a given test may be lowered by other observed p-values in the family of tests. It's also true that the adjusted p-value depends on the number of tests included in the family, but this seems somewhat less troubling.

To examine the impact of the procedure on the adjusted p-values for the individual tests, we'll compare the candidate ap(m) from step 2a against the actual ap(m). Our sense is that to the degree these are different, the adjusted p-value should not be extracted from the context of the observed family of tests.


SAS
Our SAS code relies heavily on the array statement (section 1.11.5). We loop through the p-values from largest to smallest, calculating the candidate fdr p-value as above, before arriving at the final adjusted p-value. To compare the values conveniently, we make a new data set with two copies of the original data set, renaming first the candidate and then the adjusted p-values to have the same names. The in = data set option creates a temporary variable which identifies which data set an observation was read from; here it denotes which version of the same data set (and which set of p-values) was used.

data fdr;
array pvals [10] pval1 - pval10
(.001 .001 .001 .001 .001 .03 .035 .04 .05 .05);
array cfdrpvals [10] cfdr1 - cfdr10;
array fdrpvals [10] fdr1 - fdr10;
fdrpvals[10] = pvals[10];
do i = 9 to 1 by -1;
cfdrpvals[i] = pvals[i] * 10/i;
if cfdrpvals[i] > fdrpvals[i+1] then fdrpvals[i] = fdrpvals[i+1];
else fdrpvals[i] = cfdrpvals[i];
end;
run;

data compare;
set fdr (in = cfdr rename=(cfdr1=c1 cfdr2=c2 cfdr3=c3 cfdr4=c4
cfdr5=c5 cfdr6=c6 cfdr7=c7 cfdr8=c8 cfdr9=c9))
fdr (in = fdr rename=(fdr1=c1 fdr2=c2 fdr3=c3 fdr4=c4 fdr5=c5
fdr6=c6 fdr7=c7 fdr8=c8 fdr9=c9));
if cfdr then adjustment = "Candidate fdr";
if fdr then adjustment = "Final fdr";
run;

proc print data = compare; var adjustment c1-c9; run;

adjustment c1 c2 c3 c4 c5 c6 c7 c8 c9

Candidate fdr 0.010 .005 .0033 .0025 .002 .05 .05 .05 .055
Final fdr 0.002 .002 .0020 .0020 .002 .05 .05 .05 .050

(We omit the last p-value because the adjustment does not affect it.) The result shows that for many of the tests in this family, a substantially smaller p-value is obtained with the final FDR p-value than the candidate. To this degree, the FDR p-value is dependent on the observed values of the p-values in the tests in the family, and ought not to be removed from the context of these other tests. We would recommend caution in displaying the FDR p-values in such settings, given readers' propensity to use them as if they were ordinary p-values, safely adjusted for multiple testing.

R
Comparison of the R and SAS code may make SAS programmers weep. The candidate values are easily calculated, and can be presented with the final p-values in one step using the p.adjust() function. Three lines of code, albeit incorporating multiple functions in each line. (And it could sensibly be done in two, calculating the candidate p-values within the rbind() function call.) Note especially the line calculating the candidate p-values, in which vectorization allows a for loop to be avoided in a very natural fashion.

fakeps = c(rep(.2, 5), 6, 7, 8, 10, 10)/200
cfdr = fakeps * 10/(1:10)
rbind(cfdr, fdr=p.adjust(fakeps, "fdr"))[,1:9]

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
cfdr 0.010 0.005 0.0033 0.0025 0.002 0.05 0.05 0.05 0.0556 0.05
fdr 0.002 0.002 0.0020 0.0020 0.002 0.05 0.05 0.05 0.0500 0.05


An unrelated note about aggregatorsWe love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers and PROC-X with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

To leave a comment for the author, please follow the link and comment on his blog: SAS and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Source R-Script from Dropbox

at by Kay Cichini on R-Bloggers

(This article was first published on theBioBucket*, and kindly contributed to R-bloggers)

A quick tip on how to source R-scripts from a Dropbox-account:

(1) Upload the script..

(2) Get link with the get link option. The link looks like "https://www.dropbox.com/s/XXXXXX/yourscript.R"..

(3) Grab this part "XXXXXX/yourscript.R" and paste it to "http://dl.dropbox.com/s/"..





(4) the final URL that can be sourced:
source("http://dl.dropbox.com/s/XXXXXX/yourscript.R")
..an example with this script stored at my Dropbox account:
source("http://dl.dropbox.com/s/c18lcwnnrodsevt/test_dropbox_source.R")


To leave a comment for the author, please follow the link and comment on his blog: theBioBucket*.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Text Mining to Word Cloud App with R

at by CloudStat on R-Bloggers

(This article was first published on CloudStat, and kindly contributed to R-bloggers)

Here is a simple application to transform text into a beautiful word cloud, Text Mining to WordCloud. The purpose is to find out the highest frequency word in a certain text. It is an app built with R language, the source code is attached at the end of the post.

For example, we want to know the words frequency of the “I have a Dream” by Martin Luther King, Jr :

And so even though we face the difficulties of today and tomorrow, I still have a dream. It is a dream deeply rooted in the American dream.

I have a dream that one day this nation will rise up and live out the true meaning of its creed:

We hold these truths to be self-evident, that all men are created equal.

I have a dream that one day on the red hills of Georgia, the sons of former slaves and the sons of former slave owners will be able to sit down together at the table of brotherhood.

I have a dream that one day even the state of Mississippi, a state sweltering with the heat of injustice, sweltering with the heat of oppression, will be transformed into an oasis of freedom and justice.

I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character.

I have a dream today!

I have a dream that one day, down in Alabama, with its vicious racists, with its governor having his lips dripping with the words of interposition and nullification — one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers.

I have a dream today!

I have a dream that one day every valley shall be exalted, and every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight; and the glory of the Lord shall be revealed and all flesh shall see it together.

This is our hope, and this is the faith that I go back to the South with.

With this faith, we will be able to hew out of the mountain of despair a stone of hope. With this faith, we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith, we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day.

And this will be the day — this will be the day when all of God’s children will be able to sing with new meaning:

My country ‘tis of thee, sweet land of liberty, of thee I sing.
Land where my fathers died, land of the Pilgrim’s pride,
From every mountainside, let freedom ring!
And if America is to be a great nation, this must become true.
And so let freedom ring from the prodigious hilltops of New Hampshire.
Let freedom ring from the mighty mountains of New York.
Let freedom ring from the heightening Alleghenies of Pennsylvania.
Let freedom ring from the snow-capped Rockies of Colorado.
Let freedom ring from the curvaceous slopes of California.

But not only that:
Let freedom ring from Stone Mountain of Georgia.
Let freedom ring from Lookout Mountain of Tennessee.
Let freedom ring from every hill and molehill of Mississippi.
From every mountainside, let freedom ring.
And when this happens, when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God’s children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual:
Free at last! Free at last!

Thank God Almighty, we are free at last!

We just need to copy the text above (default text) into the textarea and click “Execute”.

You will get the result as below:

You can observe that “Freedom” is the most stated word!

However, there are some restrictions to use it:

  1. Double Quote mark ([b]”[/b]) is not allowed.
  2. Only Latin alphabet, i.e. English text allowed.

Source code:

A. Backend (Code)

[—hidestart—]
library(plyr)
library(stringr)
library(ggplot2)
library(doBy)
library(RColorBrewer)
require(gdata)
require(tm)
require(wordcloud)

oritext <- “[—||TEXT||—]”
[—hideend—]
plot(1)
[—hidestart—]
RemoveAtPeople <- function(text2wc) {
  gsub(“@\w+”, “”, text2wc)
}
#Then for example, remove @d names
text2wcs <- as.vector(sapply(oritext, RemoveAtPeople))
generateCorpus= function(df,my.stopwords=c()){
  text2.corpus= Corpus(VectorSource(df))
  text2.corpus = tm_map(text2.corpus, removePunctuation)
  text2.corpus = tm_map(text2.corpus, tolower)
  text2.corpus = tm_map(text2.corpus, removeWords, stopwords(“english”))
  text2.corpus = tm_map(text2.corpus, removeWords, my.stopwords)
  text2.corpus
}

pal2 <- brewer.pal(8,”Dark2”)

wordcloud.generate=function(corpus,min.freq=2){
  doc.m = TermDocumentMatrix(corpus, control = list(minWordLength = 1))
  dm = as.matrix(doc.m)
  # calculate the frequency of words
  v = sort(rowSums(dm), decreasing=TRUE)
  d = data.frame(word=names(v), freq=v)
  #Generate the wordcloud
  wc=wordcloud(d$word, d$freq, min.freq=min.freq, colors=pal2)
  wc
}
print(wordcloud.generate(generateCorpus(text2wcs,”dev8d”),2))
#
[—hideend—]

B. Fronted (Application)

<form action=”http://www.cloudstat.org/index.php?do=/kafechew/blog/text-mining-to-wordcloud-2/” method=”post” name=”cForm” id=”cForm”><input name=”phpfox[security_token]” value=”45045ddc608a78286ac27e7c7d79f672” type=”hidden”><input name=”exemode” value=”1” type=”hidden”><textarea rows=”12” cols=”145” name=”cpost[TEXT]”></textarea>
<br /><input class=”button” type=”submit” value=”Execute” name=”execute”/></form>

To leave a comment for the author, please follow the link and comment on his blog: CloudStat.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

The whinny of the exponential horse

at by Ethan Brown on R-Bloggers

(This article was first published on Statisfactions: The Sounds of Data and Whimsy » R, and kindly contributed to R-bloggers)

A Poisson process provides a good model for events that happen rarely. That's what von Bortkiewicz realized in 1898 when he modeled deaths by horse kick in Prussian cavalry; since it would be ungentlemanly to actually kill my readers, I instead represent the events in a Poisson process using a horse's whinny.

Poisson processes have a very unintuitive property: they're memoryless. (I'm here talking about a univariate, homogeneous Poisson process.) They have just one parameter, the expected waiting time between events. Let's say it's five seconds. The unintuitive part is that this doesn't ever change: even if you've waited 20 seconds, the expected waiting time is still five seconds. That's why it's called memoryless: the fact that an event has just happened gives you no information about when it's going to happen next, because it simply has no memory. It seems like after 20 seconds you'd be due for an event, but that's just not the way it works.

To tie together the math and the horse, here's a Poisson process with an expected waiting time of 5 seconds. Click below to hear the audio:

Neighification of the Exponential Horse

Here's the R code. I did a little tweak of playitbyr's shape_dotplot to make this clip, which you can easily install via install_github:

install.packages("devtools")
library(devtools)
install_github("playitbyr", username = "statisfactions", branch = "horsie")

In order to hear the sound you'll need to install and set up the csound package and Csound; some setup tips are here.

Now you can load the package, generate the Poisson process, and listen to the output.

library(playitbyhoof)
 
time <- 120 # total time (s)
expect <- 5 # expected wait time (s)
 
neigh <- rexp(1, rate = 1/expect)
while(neigh[length(neigh)] < n)
  neigh <- c(neigh, neigh[length(neigh)] + rexp(1, rate = 1/expect))
 
neighd <- data.frame(neigh = neigh)
 
sonify(neighd, sonaes(time = neigh)) +
  shape_horsie(relative = F) + scale_time_identity()

Matt Asher's Statistics Blog has a fun post on using a Poisson process to keep track of what you're doing right now by setting an expected wait time of 15 minutes. You can do this by simply changing expect, above, to 900 seconds and upping the total time of the sonification.

For more historical horsing around, my friend Maggie Boys is starting a business, Galloping Antiquities, to make rocking horses inspired by ancient artistic depictions of horses.

Image © Thowra_uk (Creative Commons BY)
Horse neigh sample © 3bagbrew (Creative Commons BY)

To leave a comment for the author, please follow the link and comment on his blog: Statisfactions: The Sounds of Data and Whimsy » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...