Purugganan lab at NYU, in collaboration with Oxford Nanopore Technologies and the New York Genome Center, have sequenced and assembled the genome of an iconic variety of rice grown in the Indian subcontinent - the basmati rice. The paper was published in Genome Biology, and the accompanying data were deposited in Zenodo. We will explore some of the basic assembly statistics and compare them to the values reported in the paper.
Please download the genome file we will be working with today from here.
We will be working with a polished but unscaffolded version of the genome assembly of the Pakistani variety Basmati 334. Genome assemblies are usually stored in FASTA format. Here is what it looks like:
>header_1
ATCGATCTAGCGATCGAGCTATATATATCCCGCGTAG
>header_2
TAGCGATAGCGGGCATCGATTCAACGCTAGCTGATGC
Note: sequences (but not headers) may be split across multiple lines, but this is not the case in the file we will be working with. In our file, each header and each sequence is a single line of text.
In a perfect genome assembly (e.g. that of C. elegans), the number of sequences will equal the number of chromosomes. However, chromosome-level assemblies are still somewhat rare, and genomes are usually assembled as a larger number of disconnected fragments. These fragments are called “contigs” because they represent contiguous assemblies of shorter sequences.
Let us import the Basmati genome file and examine the number of contigs. FASTA files are text files but they are not really organized as tables. Therefore, we will be using the function readLines()
instead of the more familiar read.table()
or read.csv()
(although you may be able to use those, too). readLines()
reads a text file line-by-line and returns a character vector where each element is a line from the file.
genome <- readLines("Basmati334.basmati.not_scaffolded_singleline.fa")
str(genome)
## chr [1:376] ">contig_1" ...
# use substr() to only display the first 10 characters of each element, because your laptop will likely freeze otherwise (most sequences are very long)
substr(head(genome),
start = 1,
stop = 10)
## [1] ">contig_1" "AATTTTAGTT" ">contig_2" "GAGGGGAAGG" ">contig_3"
## [6] "CACTCCAAAC"
Let us create a data.frame with contig names in the first column and contig lengths in the second column. First, extract the contig names.
# contig names are the odd elements of the genome vector
# how would you extract all the odd elements?
contig_names <- genome[c(TRUE,FALSE)] # but there are other ways, too
head(contig_names)
## [1] ">contig_1" ">contig_2" ">contig_3" ">contig_4" ">contig_5" ">contig_6"
# get rid of ">" by splitting each element into the "before >" and "after >" part
contig_names_split <- strsplit(contig_names,
split = ">") # split each element by ">"
head(contig_names_split)
## [[1]]
## [1] "" "contig_1"
##
## [[2]]
## [1] "" "contig_2"
##
## [[3]]
## [1] "" "contig_3"
##
## [[4]]
## [1] "" "contig_4"
##
## [[5]]
## [1] "" "contig_5"
##
## [[6]]
## [1] "" "contig_6"
class(contig_names_split)
## [1] "list"
contig_names_split_unlisted <- unlist(contig_names_split) # convert list to vector
head(contig_names_split_unlisted)
## [1] "" "contig_1" "" "contig_2" "" "contig_3"
class(contig_names_split_unlisted)
## [1] "character"
contig_names_split_unlisted_cleaned <- contig_names_split_unlisted[c(FALSE,TRUE)] # only keep even elements
head(contig_names_split_unlisted_cleaned)
## [1] "contig_1" "contig_2" "contig_3" "contig_4" "contig_5" "contig_6"
# a simpler way - get rid of the first character in each element (since ">" is always just 1 character)
contig_names_split_unlisted_cleaned <- substring(contig_names,2)
head(contig_names_split_unlisted_cleaned)
## [1] "contig_1" "contig_2" "contig_3" "contig_4" "contig_5" "contig_6"
Now, extract contig sequences and calculate their length.
# sequences are the even elements of the genome vector
sequences <- genome[c(FALSE,TRUE)]
# calculate the length (the number of characters) of each sequences
sequences_length <- nchar(sequences)
head(sequences_length)
## [1] 498296 9857291 67462 8941 199118 192537
Create a data.frame with contig names in the first column and contig lengths in the second column.
basmati <- data.frame("contig_name" = contig_names_split_unlisted_cleaned,
"contig_length" = sequences_length)
head(basmati)
## contig_name contig_length
## 1 contig_1 498296
## 2 contig_2 9857291
## 3 contig_3 67462
## 4 contig_4 8941
## 5 contig_5 199118
## 6 contig_6 192537
How are the contig lengths distributed? Plot a histogram and a horizontal distribution of individual data points in one plot using ggplot2
.
ggplot(data = basmati,
mapping = aes(x = contig_length)) +
geom_histogram() +
geom_jitter(mapping = aes(y = -20),
height = 10,
width = 0)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is essentially an empirical probability density function (ePDF).
Let us calculate “by hand” an empirical cumulative density function (eCDF) of the contig lengths and plot it. The X axis will be the same as above. The main difference is that the Y axis must now contain the cumulative fraction of contigs as we are moving left-to-right (i.e. from smallest to largest contig length).
# sort contigs by length and add a column containing the cumulative fraction of contigs (in our case 1/188, 2/188, 3/188, etc.) using dplyr
basmati_ecdf <-
basmati %>%
arrange(contig_length) %>%
mutate(cumulative_fraction_contigs = (1:n()) / n() )
# plot using geom_point
ggplot(data = basmati_ecdf,
mapping = aes(x = contig_length,
y = cumulative_fraction_contigs)) +
geom_point()
# compare to the built-in function ecdf() to make sure that we got it right
plot(ecdf(basmati_ecdf$contig_length))
How long are the top 50% of the contigs? (The desired answer is something like “longer than XXX kb”). First, draw a line on the plot corresponding to the 50% of contigs and determine by eye where approximately it intersects the eCDF. Then, look at the table we generated and determine the exact number. Plot this number as a vertical line.
# copy the ggplot code from the previous chunk
# and add a horizontal line drawn at 50%
ggplot(data = basmati_ecdf,
mapping = aes(x = contig_length,
y = cumulative_fraction_contigs)) +
geom_point() +
geom_hline(yintercept = 0.5)
# print basmati_ecdf and determine what is the shortest contig in the top 50%
basmati_ecdf
## contig_name contig_length cumulative_fraction_contigs
## 1 contig_81 9 0.005319149
## 2 contig_24 7433 0.010638298
## 3 contig_153 7528 0.015957447
## 4 contig_42 8105 0.021276596
## 5 contig_4 8941 0.026595745
## 6 contig_101 9480 0.031914894
## 7 contig_25 9952 0.037234043
## 8 contig_20 11700 0.042553191
## 9 contig_62 12032 0.047872340
## 10 contig_183 14212 0.053191489
## 11 contig_181 15949 0.058510638
## 12 contig_53 16427 0.063829787
## 13 contig_145 18745 0.069148936
## 14 contig_36 24859 0.074468085
## 15 contig_47 29542 0.079787234
## 16 contig_27 31210 0.085106383
## 17 contig_39 32749 0.090425532
## 18 contig_17 33362 0.095744681
## 19 contig_35 33483 0.101063830
## 20 contig_169 37647 0.106382979
## 21 contig_41 40016 0.111702128
## 22 contig_49 41336 0.117021277
## 23 contig_16 41376 0.122340426
## 24 contig_51 44177 0.127659574
## 25 contig_40 44786 0.132978723
## 26 contig_54 44788 0.138297872
## 27 contig_13 45535 0.143617021
## 28 contig_21 48692 0.148936170
## 29 contig_11 49007 0.154255319
## 30 contig_43 52809 0.159574468
## 31 contig_22 55304 0.164893617
## 32 contig_33 56366 0.170212766
## 33 contig_23 61183 0.175531915
## 34 contig_8 63575 0.180851064
## 35 contig_18 64234 0.186170213
## 36 contig_29 67282 0.191489362
## 37 contig_3 67462 0.196808511
## 38 contig_156 69065 0.202127660
## 39 contig_28 71871 0.207446809
## 40 contig_26 72890 0.212765957
## 41 contig_50 75157 0.218085106
## 42 contig_186 76681 0.223404255
## 43 contig_38 77801 0.228723404
## 44 contig_10 81664 0.234042553
## 45 contig_187 83191 0.239361702
## 46 contig_46 87276 0.244680851
## 47 contig_9 91244 0.250000000
## 48 contig_12 100238 0.255319149
## 49 contig_175 100570 0.260638298
## 50 contig_173 102007 0.265957447
## 51 contig_30 104186 0.271276596
## 52 contig_180 107649 0.276595745
## 53 contig_32 108243 0.281914894
## 54 contig_48 109033 0.287234043
## 55 contig_149 110428 0.292553191
## 56 contig_37 112909 0.297872340
## 57 contig_34 113587 0.303191489
## 58 contig_160 113942 0.308510638
## 59 contig_45 119599 0.313829787
## 60 contig_143 120182 0.319148936
## 61 contig_178 143039 0.324468085
## 62 contig_97 145297 0.329787234
## 63 contig_7 146029 0.335106383
## 64 contig_72 148012 0.340425532
## 65 contig_19 149136 0.345744681
## 66 contig_182 154743 0.351063830
## 67 contig_171 155222 0.356382979
## 68 contig_188 166611 0.361702128
## 69 contig_52 172586 0.367021277
## 70 contig_177 178566 0.372340426
## 71 contig_172 187082 0.377659574
## 72 contig_6 192537 0.382978723
## 73 contig_5 199118 0.388297872
## 74 contig_179 213575 0.393617021
## 75 contig_184 222827 0.398936170
## 76 contig_146 223805 0.404255319
## 77 contig_127 242673 0.409574468
## 78 contig_168 245319 0.414893617
## 79 contig_161 252103 0.420212766
## 80 contig_107 266072 0.425531915
## 81 contig_158 271912 0.430851064
## 82 contig_121 274679 0.436170213
## 83 contig_155 278187 0.441489362
## 84 contig_123 284254 0.446808511
## 85 contig_151 322212 0.452127660
## 86 contig_170 330676 0.457446809
## 87 contig_185 348122 0.462765957
## 88 contig_162 361942 0.468085106
## 89 contig_68 380643 0.473404255
## 90 contig_148 395160 0.478723404
## 91 contig_159 421338 0.484042553
## 92 contig_150 430447 0.489361702
## 93 contig_174 441564 0.494680851
## 94 contig_1 498296 0.500000000
## 95 contig_129 539136 0.505319149
## 96 contig_142 539438 0.510638298
## 97 contig_76 555924 0.515957447
## 98 contig_110 596148 0.521276596
## 99 contig_163 600324 0.526595745
## 100 contig_109 614511 0.531914894
## 101 contig_84 624327 0.537234043
## 102 contig_133 627997 0.542553191
## 103 contig_114 647674 0.547872340
## 104 contig_92 653668 0.553191489
## 105 contig_147 668131 0.558510638
## 106 contig_157 718679 0.563829787
## 107 contig_82 720367 0.569148936
## 108 contig_58 757613 0.574468085
## 109 contig_80 823800 0.579787234
## 110 contig_57 866722 0.585106383
## 111 contig_86 890409 0.590425532
## 112 contig_132 918977 0.595744681
## 113 contig_134 919952 0.601063830
## 114 contig_140 1008158 0.606382979
## 115 contig_78 1033031 0.611702128
## 116 contig_73 1044897 0.617021277
## 117 contig_66 1137797 0.622340426
## 118 contig_116 1191750 0.627659574
## 119 contig_126 1229814 0.632978723
## 120 contig_167 1231404 0.638297872
## 121 contig_137 1311625 0.643617021
## 122 contig_96 1312965 0.648936170
## 123 contig_115 1356278 0.654255319
## 124 contig_95 1466788 0.659574468
## 125 contig_112 1532608 0.664893617
## 126 contig_100 1548493 0.670212766
## 127 contig_164 1585953 0.675531915
## 128 contig_108 1589164 0.680851064
## 129 contig_89 1767502 0.686170213
## 130 contig_79 1968141 0.691489362
## 131 contig_64 1991304 0.696808511
## 132 contig_144 2068535 0.702127660
## 133 contig_119 2114382 0.707446809
## 134 contig_15 2192592 0.712765957
## 135 contig_103 2205000 0.718085106
## 136 contig_75 2214126 0.723404255
## 137 contig_94 2597398 0.728723404
## 138 contig_166 2650669 0.734042553
## 139 contig_99 2734327 0.739361702
## 140 contig_117 2742382 0.744680851
## 141 contig_111 3007642 0.750000000
## 142 contig_105 3073012 0.755319149
## 143 contig_56 3199752 0.760638298
## 144 contig_87 3410783 0.765957447
## 145 contig_63 3414174 0.771276596
## 146 contig_93 3482043 0.776595745
## 147 contig_102 3559400 0.781914894
## 148 contig_136 3648715 0.787234043
## 149 contig_113 3690720 0.792553191
## 150 contig_91 3772550 0.797872340
## 151 contig_139 3796564 0.803191489
## 152 contig_165 3839796 0.808510638
## 153 contig_154 3905687 0.813829787
## 154 contig_125 3909978 0.819148936
## 155 contig_104 3928836 0.824468085
## 156 contig_88 4178305 0.829787234
## 157 contig_120 4327600 0.835106383
## 158 contig_135 4340191 0.840425532
## 159 contig_44 4405158 0.845744681
## 160 contig_131 4605396 0.851063830
## 161 contig_122 4871081 0.856382979
## 162 contig_118 5145525 0.861702128
## 163 contig_141 5267630 0.867021277
## 164 contig_106 5343470 0.872340426
## 165 contig_60 5367701 0.877659574
## 166 contig_65 5479328 0.882978723
## 167 contig_138 5718025 0.888297872
## 168 contig_176 6112879 0.893617021
## 169 contig_67 6316586 0.898936170
## 170 contig_130 6344805 0.904255319
## 171 contig_77 6483238 0.909574468
## 172 contig_128 6953941 0.914893617
## 173 contig_85 7478188 0.920212766
## 174 contig_69 7969657 0.925531915
## 175 contig_70 8515110 0.930851064
## 176 contig_152 8752696 0.936170213
## 177 contig_14 8941791 0.941489362
## 178 contig_71 8990282 0.946808511
## 179 contig_59 9018131 0.952127660
## 180 contig_55 9647745 0.957446809
## 181 contig_2 9857291 0.962765957
## 182 contig_124 10203661 0.968085106
## 183 contig_61 10722564 0.973404255
## 184 contig_31 11117648 0.978723404
## 185 contig_83 12223136 0.984042553
## 186 contig_98 12370970 0.989361702
## 187 contig_74 16390624 0.994680851
## 188 contig_90 17040366 1.000000000
# does this relate to any distribution metric that you are familiar with?
median(basmati_ecdf$contig_length)
## [1] 518716
# copy the ggplot code above
# and add a vertical line drawn at the value you just determined
ggplot(data = basmati_ecdf,
mapping = aes(x = contig_length,
y = cumulative_fraction_contigs)) +
geom_point() +
geom_hline(yintercept = 0.5) +
geom_vline(xintercept = median(basmati_ecdf$contig_length))
In genome biology, the most common way to report the contiguity of an assembly is not the median contig length, but the N statistics, e.g. N50, N90 etc. According to Wikipedia, “given a set of contigs, the N50 is defined as the sequence length of the shortest contig at 50% of the total genome length.” Do not worry if this sounds somewhat confusing. It will become much clearer once we visualize it below.
In fact, the idea of N statistics is inspired by the concept of eCDF. The main difference is that instead of a cumulative fraction of contigs (in our case, 1/188, 2/188, 3/188 etc.), we plot a cumulative length of contigs. Let us calculate the cumulative length and plot it.
# add a column containing the cumulative length of contigs normalized to the total length of the assembly
# hint: use the function cumsum()
basmati_ecdf_n <-
basmati_ecdf %>%
mutate(cumulative_length_contigs = cumsum(contig_length) / sum(contig_length) )
# plot the empirical cumulative length function
ggplot(data = basmati_ecdf_n,
mapping = aes(x = contig_length,
y = cumulative_length_contigs)) +
geom_point()
How long are the contigs that contain top 50% of all bases? Add a horizontal line at y=0.5 and determine where it will intersect the function.
# copy the ggplot code from the previous chunk
# and draw a horizontal line at 50%
ggplot(data = basmati_ecdf_n,
mapping = aes(x = contig_length,
y = cumulative_length_contigs)) +
geom_point() +
geom_hline(yintercept = 0.5)
Determine the exact length of the first contig above the 50% line by looking at the table.
basmati_ecdf_n
## contig_name contig_length cumulative_fraction_contigs
## 1 contig_81 9 0.005319149
## 2 contig_24 7433 0.010638298
## 3 contig_153 7528 0.015957447
## 4 contig_42 8105 0.021276596
## 5 contig_4 8941 0.026595745
## 6 contig_101 9480 0.031914894
## 7 contig_25 9952 0.037234043
## 8 contig_20 11700 0.042553191
## 9 contig_62 12032 0.047872340
## 10 contig_183 14212 0.053191489
## 11 contig_181 15949 0.058510638
## 12 contig_53 16427 0.063829787
## 13 contig_145 18745 0.069148936
## 14 contig_36 24859 0.074468085
## 15 contig_47 29542 0.079787234
## 16 contig_27 31210 0.085106383
## 17 contig_39 32749 0.090425532
## 18 contig_17 33362 0.095744681
## 19 contig_35 33483 0.101063830
## 20 contig_169 37647 0.106382979
## 21 contig_41 40016 0.111702128
## 22 contig_49 41336 0.117021277
## 23 contig_16 41376 0.122340426
## 24 contig_51 44177 0.127659574
## 25 contig_40 44786 0.132978723
## 26 contig_54 44788 0.138297872
## 27 contig_13 45535 0.143617021
## 28 contig_21 48692 0.148936170
## 29 contig_11 49007 0.154255319
## 30 contig_43 52809 0.159574468
## 31 contig_22 55304 0.164893617
## 32 contig_33 56366 0.170212766
## 33 contig_23 61183 0.175531915
## 34 contig_8 63575 0.180851064
## 35 contig_18 64234 0.186170213
## 36 contig_29 67282 0.191489362
## 37 contig_3 67462 0.196808511
## 38 contig_156 69065 0.202127660
## 39 contig_28 71871 0.207446809
## 40 contig_26 72890 0.212765957
## 41 contig_50 75157 0.218085106
## 42 contig_186 76681 0.223404255
## 43 contig_38 77801 0.228723404
## 44 contig_10 81664 0.234042553
## 45 contig_187 83191 0.239361702
## 46 contig_46 87276 0.244680851
## 47 contig_9 91244 0.250000000
## 48 contig_12 100238 0.255319149
## 49 contig_175 100570 0.260638298
## 50 contig_173 102007 0.265957447
## 51 contig_30 104186 0.271276596
## 52 contig_180 107649 0.276595745
## 53 contig_32 108243 0.281914894
## 54 contig_48 109033 0.287234043
## 55 contig_149 110428 0.292553191
## 56 contig_37 112909 0.297872340
## 57 contig_34 113587 0.303191489
## 58 contig_160 113942 0.308510638
## 59 contig_45 119599 0.313829787
## 60 contig_143 120182 0.319148936
## 61 contig_178 143039 0.324468085
## 62 contig_97 145297 0.329787234
## 63 contig_7 146029 0.335106383
## 64 contig_72 148012 0.340425532
## 65 contig_19 149136 0.345744681
## 66 contig_182 154743 0.351063830
## 67 contig_171 155222 0.356382979
## 68 contig_188 166611 0.361702128
## 69 contig_52 172586 0.367021277
## 70 contig_177 178566 0.372340426
## 71 contig_172 187082 0.377659574
## 72 contig_6 192537 0.382978723
## 73 contig_5 199118 0.388297872
## 74 contig_179 213575 0.393617021
## 75 contig_184 222827 0.398936170
## 76 contig_146 223805 0.404255319
## 77 contig_127 242673 0.409574468
## 78 contig_168 245319 0.414893617
## 79 contig_161 252103 0.420212766
## 80 contig_107 266072 0.425531915
## 81 contig_158 271912 0.430851064
## 82 contig_121 274679 0.436170213
## 83 contig_155 278187 0.441489362
## 84 contig_123 284254 0.446808511
## 85 contig_151 322212 0.452127660
## 86 contig_170 330676 0.457446809
## 87 contig_185 348122 0.462765957
## 88 contig_162 361942 0.468085106
## 89 contig_68 380643 0.473404255
## 90 contig_148 395160 0.478723404
## 91 contig_159 421338 0.484042553
## 92 contig_150 430447 0.489361702
## 93 contig_174 441564 0.494680851
## 94 contig_1 498296 0.500000000
## 95 contig_129 539136 0.505319149
## 96 contig_142 539438 0.510638298
## 97 contig_76 555924 0.515957447
## 98 contig_110 596148 0.521276596
## 99 contig_163 600324 0.526595745
## 100 contig_109 614511 0.531914894
## 101 contig_84 624327 0.537234043
## 102 contig_133 627997 0.542553191
## 103 contig_114 647674 0.547872340
## 104 contig_92 653668 0.553191489
## 105 contig_147 668131 0.558510638
## 106 contig_157 718679 0.563829787
## 107 contig_82 720367 0.569148936
## 108 contig_58 757613 0.574468085
## 109 contig_80 823800 0.579787234
## 110 contig_57 866722 0.585106383
## 111 contig_86 890409 0.590425532
## 112 contig_132 918977 0.595744681
## 113 contig_134 919952 0.601063830
## 114 contig_140 1008158 0.606382979
## 115 contig_78 1033031 0.611702128
## 116 contig_73 1044897 0.617021277
## 117 contig_66 1137797 0.622340426
## 118 contig_116 1191750 0.627659574
## 119 contig_126 1229814 0.632978723
## 120 contig_167 1231404 0.638297872
## 121 contig_137 1311625 0.643617021
## 122 contig_96 1312965 0.648936170
## 123 contig_115 1356278 0.654255319
## 124 contig_95 1466788 0.659574468
## 125 contig_112 1532608 0.664893617
## 126 contig_100 1548493 0.670212766
## 127 contig_164 1585953 0.675531915
## 128 contig_108 1589164 0.680851064
## 129 contig_89 1767502 0.686170213
## 130 contig_79 1968141 0.691489362
## 131 contig_64 1991304 0.696808511
## 132 contig_144 2068535 0.702127660
## 133 contig_119 2114382 0.707446809
## 134 contig_15 2192592 0.712765957
## 135 contig_103 2205000 0.718085106
## 136 contig_75 2214126 0.723404255
## 137 contig_94 2597398 0.728723404
## 138 contig_166 2650669 0.734042553
## 139 contig_99 2734327 0.739361702
## 140 contig_117 2742382 0.744680851
## 141 contig_111 3007642 0.750000000
## 142 contig_105 3073012 0.755319149
## 143 contig_56 3199752 0.760638298
## 144 contig_87 3410783 0.765957447
## 145 contig_63 3414174 0.771276596
## 146 contig_93 3482043 0.776595745
## 147 contig_102 3559400 0.781914894
## 148 contig_136 3648715 0.787234043
## 149 contig_113 3690720 0.792553191
## 150 contig_91 3772550 0.797872340
## 151 contig_139 3796564 0.803191489
## 152 contig_165 3839796 0.808510638
## 153 contig_154 3905687 0.813829787
## 154 contig_125 3909978 0.819148936
## 155 contig_104 3928836 0.824468085
## 156 contig_88 4178305 0.829787234
## 157 contig_120 4327600 0.835106383
## 158 contig_135 4340191 0.840425532
## 159 contig_44 4405158 0.845744681
## 160 contig_131 4605396 0.851063830
## 161 contig_122 4871081 0.856382979
## 162 contig_118 5145525 0.861702128
## 163 contig_141 5267630 0.867021277
## 164 contig_106 5343470 0.872340426
## 165 contig_60 5367701 0.877659574
## 166 contig_65 5479328 0.882978723
## 167 contig_138 5718025 0.888297872
## 168 contig_176 6112879 0.893617021
## 169 contig_67 6316586 0.898936170
## 170 contig_130 6344805 0.904255319
## 171 contig_77 6483238 0.909574468
## 172 contig_128 6953941 0.914893617
## 173 contig_85 7478188 0.920212766
## 174 contig_69 7969657 0.925531915
## 175 contig_70 8515110 0.930851064
## 176 contig_152 8752696 0.936170213
## 177 contig_14 8941791 0.941489362
## 178 contig_71 8990282 0.946808511
## 179 contig_59 9018131 0.952127660
## 180 contig_55 9647745 0.957446809
## 181 contig_2 9857291 0.962765957
## 182 contig_124 10203661 0.968085106
## 183 contig_61 10722564 0.973404255
## 184 contig_31 11117648 0.978723404
## 185 contig_83 12223136 0.984042553
## 186 contig_98 12370970 0.989361702
## 187 contig_74 16390624 0.994680851
## 188 contig_90 17040366 1.000000000
## cumulative_length_contigs
## 1 2.328254e-08
## 2 1.925207e-05
## 3 3.872663e-05
## 4 5.969385e-05
## 5 8.282376e-05
## 6 1.073480e-04
## 7 1.330934e-04
## 8 1.633607e-04
## 9 1.944868e-04
## 10 2.312525e-04
## 11 2.725118e-04
## 12 3.150076e-04
## 13 3.635000e-04
## 14 4.278089e-04
## 15 5.042326e-04
## 16 5.849713e-04
## 17 6.696913e-04
## 18 7.559971e-04
## 19 8.426159e-04
## 20 9.400067e-04
## 21 1.043526e-03
## 22 1.150460e-03
## 23 1.257498e-03
## 24 1.371781e-03
## 25 1.487641e-03
## 26 1.603505e-03
## 27 1.721302e-03
## 28 1.847265e-03
## 29 1.974044e-03
## 30 2.110658e-03
## 31 2.253727e-03
## 32 2.399543e-03
## 33 2.557820e-03
## 34 2.722285e-03
## 35 2.888455e-03
## 36 3.062510e-03
## 37 3.237031e-03
## 38 3.415699e-03
## 39 3.601625e-03
## 40 3.790188e-03
## 41 3.984616e-03
## 42 4.182985e-03
## 43 4.384253e-03
## 44 4.595513e-03
## 45 4.810724e-03
## 46 5.036503e-03
## 47 5.272546e-03
## 48 5.531857e-03
## 49 5.792026e-03
## 50 6.055913e-03
## 51 6.325437e-03
## 52 6.603920e-03
## 53 6.883939e-03
## 54 7.166001e-03
## 55 7.451673e-03
## 56 7.743763e-03
## 57 8.037607e-03
## 58 8.332369e-03
## 59 8.641765e-03
## 60 8.952670e-03
## 61 9.322705e-03
## 62 9.698581e-03
## 63 1.007635e-02
## 64 1.045925e-02
## 65 1.084506e-02
## 66 1.124537e-02
## 67 1.164692e-02
## 68 1.207793e-02
## 69 1.252441e-02
## 70 1.298635e-02
## 71 1.347032e-02
## 72 1.396840e-02
## 73 1.448351e-02
## 74 1.503602e-02
## 75 1.561246e-02
## 76 1.619143e-02
## 77 1.681921e-02
## 78 1.745384e-02
## 79 1.810602e-02
## 80 1.879433e-02
## 81 1.949776e-02
## 82 2.020834e-02
## 83 2.092799e-02
## 84 2.166334e-02
## 85 2.249689e-02
## 86 2.335233e-02
## 87 2.425291e-02
## 88 2.518923e-02
## 89 2.617394e-02
## 90 2.719619e-02
## 91 2.828617e-02
## 92 2.939972e-02
## 93 3.054202e-02
## 94 3.183109e-02
## 95 3.322581e-02
## 96 3.462130e-02
## 97 3.605945e-02
## 98 3.760166e-02
## 99 3.915466e-02
## 100 4.074437e-02
## 101 4.235947e-02
## 102 4.398407e-02
## 103 4.565957e-02
## 104 4.735058e-02
## 105 4.907900e-02
## 106 5.093818e-02
## 107 5.280174e-02
## 108 5.476164e-02
## 109 5.689277e-02
## 110 5.913494e-02
## 111 6.143838e-02
## 112 6.381573e-02
## 113 6.619559e-02
## 114 6.880365e-02
## 115 7.147605e-02
## 116 7.417914e-02
## 117 7.712256e-02
## 118 8.020556e-02
## 119 8.338703e-02
## 120 8.657261e-02
## 121 8.996571e-02
## 122 9.336229e-02
## 123 9.687091e-02
## 124 1.006654e-01
## 125 1.046302e-01
## 126 1.086361e-01
## 127 1.127388e-01
## 128 1.168499e-01
## 129 1.214224e-01
## 130 1.265138e-01
## 131 1.316653e-01
## 132 1.370164e-01
## 133 1.424862e-01
## 134 1.481584e-01
## 135 1.538626e-01
## 136 1.595904e-01
## 137 1.663098e-01
## 138 1.731669e-01
## 139 1.802405e-01
## 140 1.873349e-01
## 141 1.951155e-01
## 142 2.030652e-01
## 143 2.113428e-01
## 144 2.201663e-01
## 145 2.289986e-01
## 146 2.380065e-01
## 147 2.472145e-01
## 148 2.566535e-01
## 149 2.662012e-01
## 150 2.759606e-01
## 151 2.857821e-01
## 152 2.957155e-01
## 153 3.058193e-01
## 154 3.159342e-01
## 155 3.260979e-01
## 156 3.369070e-01
## 157 3.481023e-01
## 158 3.593301e-01
## 159 3.707260e-01
## 160 3.826400e-01
## 161 3.952412e-01
## 162 4.085524e-01
## 163 4.221795e-01
## 164 4.360028e-01
## 165 4.498887e-01
## 166 4.640635e-01
## 167 4.788557e-01
## 168 4.946694e-01
## 169 5.110101e-01
## 170 5.274238e-01
## 171 5.441956e-01
## 172 5.621851e-01
## 173 5.815308e-01
## 174 6.021479e-01
## 175 6.241761e-01
## 176 6.468188e-01
## 177 6.699508e-01
## 178 6.932082e-01
## 179 7.165376e-01
## 180 7.414959e-01
## 181 7.669962e-01
## 182 7.933925e-01
## 183 8.211312e-01
## 184 8.498920e-01
## 185 8.815127e-01
## 186 9.135157e-01
## 187 9.559174e-01
## 188 1.000000e+00
# 6316586, or 6.32 Mb
Is this the same value as the N50 reported in the abstract of the paper?
Why do genome biologists prefer N statistics to eCDF metrics, such as median?