r - Plot many categories -
i've data follow, each experiment lead apparition of composition, , each composition belong 1 or many categories. want plot occurence number of each composition:
df <- read.table(text = " comp category comp1 1 comp2 1 comp3 4,2 comp4 1,3 comp1 1,2 comp3 3 ", header = true) barplot(table(df$comp))
so worked me.
after that, composition belong 1 or many categories. there's comma separations between categories.i want barplot compo in x , nb of compo in y, , each bar % of each category.
my idea duplicate line there comma, repete n+1 number of comma.
df = table(df$category,df$comp) cats <- strsplit(rownames(df), ",", fixed = true) df <- df[rep(seq_len(nrow(df)), sapply(cats, length)),] df <- as.data.frame(unclass(df)) df$cat <- unlist(cats) df <- aggregate(. ~ cat, df, fun = sum)
it give me example: comp1
1 2 3 4 comp1 2 1 0 0
but if apply method, total number of category (3) won't correspond total number of compositions (comp1=2).
how proceed in such case ? solution devide nb of comma +1 ? if yes, how in code, , there simpliest way ?
thanks lot !
producing plot requires 2 steps, noticed. first, 1 needs prepare data, 1 can create plot.
preparing data
you have shown efforts of bringing data suitable form, let me propose alternative way.
first, have make sure category
column of data frame character , not factor. store vector of categories appear in data frame:
df$category <- as.character(df$category) cats <- unique(unlist(strsplit(df$category, ",")))
i need summarise data. purpose, need function gives each value in comp
percentage each category scaled such, sum of values gives number of rows in original data comp
.
the following function returns information entire data frame in form of data frame (the output needs data frame, because want use function do()
later).
cat_perc <- function(cats, vec) { # percentages nums <- sapply(cats, function(cat) sum(grepl(cat, vec))) perc <- nums/sum(nums) final <- perc * length(vec) df <- as.data.frame(as.list(final)) names(df) <- cats return(df) }
running function on complete data frame gives:
cat_perc(cats, df$category) ## 1 4 2 3 ## 1 2.666667 0.6666667 1.333333 1.333333
the values sum six, indeed total number of rows in original data frame.
now want run function each value of comp
, can done using dplyr
package:
library(dplyr) plot_data <- group_by(df, comp) %>% do(cat_perc(cats, .$category)) plot_data ## plot_data ## source: local data frame [4 x 5] ## groups: comp [4] ## ## comp 1 4 2 3 ## (fctr) (dbl) (dbl) (dbl) (dbl) ## 1 comp1 1.333333 0.0000000 0.6666667 0.0000000 ## 2 comp2 1.000000 0.0000000 0.0000000 0.0000000 ## 3 comp3 0.000000 0.6666667 0.6666667 0.6666667 ## 4 comp4 0.500000 0.0000000 0.0000000 0.5000000
this first groups data comp
, applies function cat_perc
subset of data frame given comp
.
i plot data ggplot2
package, requires data in so-called long format. means each data point plotted should correspond row in data frame. (as now, each row contains 4 data points.) can done tidyr
package follows:
library(tidyr) plot_data <- gather(plot_data, category, value, -comp) head(plot_data) ## source: local data frame [6 x 3] ## groups: comp [4] ## ## comp category value ## (fctr) (chr) (dbl) ## 1 comp1 1 1.333333 ## 2 comp2 1 1.000000 ## 3 comp3 1 0.000000 ## 4 comp4 1 0.500000 ## 5 comp1 4 0.000000 ## 6 comp2 4 0.000000
as can see, there single data point per row, characterised comp
, category
, corresponding value
.
plotting data
now read, can plot data using ggplot
:
library(ggplot2) ggplot(plot_data, aes(x = comp, y = value, fill = category)) + geom_bar(stat = "identity")
Comments
Post a Comment