r - Plot many categories -
i've data follow, each experiment lead apparition of composition, , each composition belong 1 or many categories. want plot occurence number of each composition:
df <- read.table(text = " comp category comp1 1 comp2 1 comp3 4,2 comp4 1,3 comp1 1,2 comp3 3 ", header = true) barplot(table(df$comp)) so worked me.
after that, composition belong 1 or many categories. there's comma separations between categories.i want barplot compo in x , nb of compo in y, , each bar % of each category.
my idea duplicate line there comma, repete n+1 number of comma.
df = table(df$category,df$comp) cats <- strsplit(rownames(df), ",", fixed = true) df <- df[rep(seq_len(nrow(df)), sapply(cats, length)),] df <- as.data.frame(unclass(df)) df$cat <- unlist(cats) df <- aggregate(. ~ cat, df, fun = sum) it give me example: comp1
1 2 3 4 comp1 2 1 0 0 but if apply method, total number of category (3) won't correspond total number of compositions (comp1=2).
how proceed in such case ? solution devide nb of comma +1 ? if yes, how in code, , there simpliest way ?
thanks lot !
producing plot requires 2 steps, noticed. first, 1 needs prepare data, 1 can create plot.
preparing data
you have shown efforts of bringing data suitable form, let me propose alternative way.
first, have make sure category column of data frame character , not factor. store vector of categories appear in data frame:
df$category <- as.character(df$category) cats <- unique(unlist(strsplit(df$category, ","))) i need summarise data. purpose, need function gives each value in comp percentage each category scaled such, sum of values gives number of rows in original data comp.
the following function returns information entire data frame in form of data frame (the output needs data frame, because want use function do() later).
cat_perc <- function(cats, vec) { # percentages nums <- sapply(cats, function(cat) sum(grepl(cat, vec))) perc <- nums/sum(nums) final <- perc * length(vec) df <- as.data.frame(as.list(final)) names(df) <- cats return(df) } running function on complete data frame gives:
cat_perc(cats, df$category) ## 1 4 2 3 ## 1 2.666667 0.6666667 1.333333 1.333333 the values sum six, indeed total number of rows in original data frame.
now want run function each value of comp, can done using dplyr package:
library(dplyr) plot_data <- group_by(df, comp) %>% do(cat_perc(cats, .$category)) plot_data ## plot_data ## source: local data frame [4 x 5] ## groups: comp [4] ## ## comp 1 4 2 3 ## (fctr) (dbl) (dbl) (dbl) (dbl) ## 1 comp1 1.333333 0.0000000 0.6666667 0.0000000 ## 2 comp2 1.000000 0.0000000 0.0000000 0.0000000 ## 3 comp3 0.000000 0.6666667 0.6666667 0.6666667 ## 4 comp4 0.500000 0.0000000 0.0000000 0.5000000 this first groups data comp , applies function cat_perc subset of data frame given comp.
i plot data ggplot2 package, requires data in so-called long format. means each data point plotted should correspond row in data frame. (as now, each row contains 4 data points.) can done tidyr package follows:
library(tidyr) plot_data <- gather(plot_data, category, value, -comp) head(plot_data) ## source: local data frame [6 x 3] ## groups: comp [4] ## ## comp category value ## (fctr) (chr) (dbl) ## 1 comp1 1 1.333333 ## 2 comp2 1 1.000000 ## 3 comp3 1 0.000000 ## 4 comp4 1 0.500000 ## 5 comp1 4 0.000000 ## 6 comp2 4 0.000000 as can see, there single data point per row, characterised comp, category , corresponding value.
plotting data
now read, can plot data using ggplot:
library(ggplot2) ggplot(plot_data, aes(x = comp, y = value, fill = category)) + geom_bar(stat = "identity") 
Comments
Post a Comment