datastructures
implements various data structures that are frequently used in computer science. Among these are for example Fibonacci heaps, stacks, queues or hashmaps. Advanced data structures are essential in many computer science and statistics problems, for example graph algorithms or string analysis. The package uses Boost
and STL
data types and extends these to R
with Rcpp
modules.
Fibonacci and binomial heaps are priority queue data structures using the minimum heap property. They can be represented using collections of trees or linked lists consisting of nodes of elements. Every node is a pair of keys and values, where the key decides the priority of the node in the heap. Fibonacci heaps have various applications, one of the most famous being in efficiently finding shortest paths using Dijkstra’s algorithm. Fibonacci heaps can add values in amortized \(\mathcal{O}(1)\) time and remove the minimum value in \(\mathcal{O}(log \, n)\) time making them a good choice in many real world scenarios. Binomial heaps have in general slightly worse asymptotic bounds. However, removing the minimum element (pop
) has an asymptotically tight bound of \(\Theta(log \, n)\). The two classes have the exact same methods, so every method explained here is available for the other class, too.
You can create a heap like this:
fheap <- fibonacci_heap("numeric") bheap <- binomial_heap("numeric")
This gives us heaps with numeric
keys. You can pick from either integer
, numeric
or character
keys.
The values an be arbitray, meaning, that you can add data.frames
, vectors
or any object you like. So if we insert several key-value pairs, the pair with the minimum key would have the highest priority. Let’s insert some values and have a look:
keys <- sample(seq(0, 1, by=.2)) values <- paste0("V", keys) fheap <- insert(fheap, keys, values) size(fheap)
## [1] 6
There are also some other ways to insert into a heap. If you want to insert a vector
ial value, do this:
fheap <- insert(fheap, -1, letters[1:5]) peek(fheap)
## $`-1`
## [1] "a" "b" "c" "d" "e"
As described above, you can basically add any tope to your heap. Here I add two ke-value pairs. The keys are provided as vectors, while the values are a list of different objects.
fheap <- insert(fheap, -2, list(a=5)) fheap <- insert(fheap, -3, data.frame(A=rnorm(5), B=1:5)) peek(fheap)
## $`-3`
## A B
## 1 -1.7522420 1
## 2 -0.4448897 2
## 3 -1.5316760 3
## 4 -0.2565086 4
## 5 -0.7080480 5
NOTE: for efficienty it is recommended to insert all values at the same time, if you can do this. This should be way faster. If we insert multiple objects, we need to wrap it around a list.
Since Fibonacci and binomial heaps use the minimum heap property the first element in the heap should be -6 -> "hello
, because -6 has the lowest value.
peek(fheap)
## $`-6`
## [1] "hallo"
size(fheap)
## [1] 12
That worked nicely. peek
gives us the first element from the heap without removing it. If we want to have it removed we call the pop
function. Of course this changes the priority of the heap:
pop(fheap)
## $`-6`
## [1] "hallo"
size(fheap)
## [1] 11
You can alternatively insert values like this:
fheap[-10] <- "V-1" peek(fheap)
## $`-10`
## [1] "V-1"
This works for all the described insert
methods above.
Sometimes, we want to decrease a key and rearrange its position in the heap. If we have a unique key this can be easily done:
decrease_key(fheap, from=-10, to=-11)
## An object of class fibonacci_heap<numeric, SEXP>
##
## Peek: -11 -> character, ...
peek(fheap)
## $`-11`
## [1] "V-1"
The function takes vectors, so you can also pass vectorial to
and from
arguments:
## An object of class fibonacci_heap<numeric, SEXP>
##
## Peek: -15 -> list, ...
peek(fheap)
## $`-15`
## $`-15`$a
## [1] 2
However, things get more complicated if we have multiple identical keys and want to decrease a specific one. In this case, in order to decrease to correct node, we need to get its handle
first:
handle(fheap, -15)
## [[1]]
## [[1]]$handle
## [1] "handle-9"
##
## [[1]]$value
## [[1]]$value$a
## [1] 2
The handle
method returns the id
of the node and the value
that is stored for a given key
(in this case -15). Thus if we want to decrease a node with ambiguous key, we also need to specify the handle (e.g. the id
):
hand <- handle(fheap, -15) decrease_key(fheap, -15, -20, hand[[1]]$handle)
## An object of class fibonacci_heap<numeric, SEXP>
##
## Peek: -20 -> list, ...
peek(fheap)
## $`-20`
## $`-20`$a
## [1] 2
The handle
method always returns a list where every element represents a node in the heap. The handle
function is vectorized, so you can supply are vector of keys and will receive the handles that fit thse:
handle(fheap, c(-3))
## [[1]]
## [[1]]$handle
## [1] "handle-8"
##
## [[1]]$value
## A B
## 1 -1.7522420 1
## 2 -0.4448897 2
## 3 -1.5316760 3
## 4 -0.2565086 4
## 5 -0.7080480 5
Sometimes you don’t know the key
or handle
, but only the value
of a node, and you want to do a decrease-key
operation. Suppose you want to decrease the key of a node with value V-1
, then you would to need to call handle
with a value argument and a missing key
:
hands <- handle(fheap, value="V1") decrease_key(fheap, from=hands[[1]]$key, to=-1000, hands[[1]]$handle)
## An object of class fibonacci_heap<numeric, SEXP>
##
## Peek: -1000 -> character, ...
peek(fheap)
## $`-1000`
## [1] "V1"
NOTE: Getting the handles from a value is a costly operation, because we need to compare all entries of the heap with the supplied argument value. If your supplied argument is, for instance, a large dataframe this will quickly become slow. So it is recommended to use this function cautiously.
The handle function for values can take multiple elements, too. However, here we need to supply a list
and not a vector (for obvious reasons).
## [[1]]
## [[1]]$handle
## [1] "handle-4"
##
## [[1]]$key
## [1] -1000
##
##
## [[2]]
## [[2]]$handle
## [1] "handle-9"
##
## [[2]]$key
## [1] -20
Finally, you can get all data from a heap, by calling the value
function:
val <- values(fheap) val[1:2]
## $`-1000`
## $`-1000`$handle
## [1] "handle-4"
##
## $`-1000`$key
## [1] -1000
##
## $`-1000`$value
## [1] "V1"
##
##
## $`-20`
## $`-20`$handle
## [1] "handle-9"
##
## $`-20`$key
## [1] -20
##
## $`-20`$value
## $`-20`$value$a
## [1] 2
In all of these scenarios neither the key nor the value need to be unique, because the handle (node id) always is.
You can remove all entries from a heap by calling:
fheap <- clear(fheap) fheap
## An object of class fibonacci_heap<numeric, SEXP>
##
## Peek: NULL -> NULL, ...
As a last example, consider this use case whre we use a character
heap.
library('purrr') bheap <- binomial_heap("character") bheap <- insert(bheap, letters[c(2:6, 5, 5)], c(2:6, 5L, 7L)) bheap <- insert(bheap, "x", data.frame(A="hi")) bheap <- insert(bheap, "x", list(a=2)) peek(bheap)
## $b
## [1] 2
vector.keys <- handle(bheap, "x") %>% purrr::map_chr(.f = function(x) x$handle) vector.keys
## [1] "handle-7" "handle-8"
decrease_key(bheap, from="x", to="b", handle=vector.keys[1])
## An object of class binomial_heap<character, SEXP>
##
## Peek: b -> data.frame, ...
peek(bheap)
## $b
## A
## 1 hi
hand <- handle(bheap, key = "b") hand
## [[1]]
## [[1]]$handle
## [1] "handle-0"
##
## [[1]]$value
## [1] 2
##
##
## [[2]]
## [[2]]$handle
## [1] "handle-7"
##
## [[2]]$value
## A
## 1 hi
decrease_key(bheap, from="b", to="a", handle=hand[[2]]$handle)
## An object of class binomial_heap<character, SEXP>
##
## Peek: a -> data.frame, ...
pop(bheap)
## $a
## A
## 1 hi
pop(bheap)
## $b
## [1] 2
Maps use hash functions to store key-value pairs. Given a key, the hash function computes a bucket where the value is stored, making lookup, insertion and deletion on average possible in \(\mathcal{O}(1)\) time. In the worst case these runtimes decrease to \(\mathcal{O}(n)\) time, where \(n\) is the number of entries in a bucket. The difference between the three classes is that hashmaps and multimaps compute a mapping from keys to values (the hash function) \[ \, f: \mathcal{K} \rightarrow \mathcal{V} \, \] in ‘one direction’, while bimaps also compute a mapping \[ \, f: \mathcal{V} \rightarrow \mathcal{K} \, \] in the ‘other direction’. So a bimap is essentially a combination of two hashmaps. The difference between a hashmap and a multimap is that for hashs only one unique key can be stored for a value, while multimaps allow storing the same key several times fur multiple values,
Hashmaps and multimaps define mappings in a single direction, for that reason we can use arbitrary R objects as values. For bimaps we can only use primitive types, otherwise it will be difficult to access certain key-value pairs.
Let’s quickly have a look at some examples. We define character
hashmap and mulimaps and character/integer
bimap.
hash <- hashmap("character") mm <- multimap("character") bimap <- bimap("character", "integer")
Insertion to maps works just like before, by calling the insert
function or using a subset operator. As explained for hashmap and multimap we can use all kinds of values
, for bimaps we need to define the a primitive type as well.
keys <- paste0("V", 1:5) values <- 1:5 hash <- insert(hash, keys[1:4], values[1:4]) mm <- insert(mm, keys[c(2, 2)], list(list(a=1), data.frame(a=rnorm(5), 2, 3))) bimap <- insert(bimap, keys[1:4], values[1:4]) hash[keys[5]] <- values[5] bimap[keys[5]] <- values[5] mm[keys[5]] <- diag(5)
Values (and keys in the case of bimaps) can then be accessed like this:
at(hash, keys[1])
## [[1]]
## [1] 1
hash[keys[1]]
## [[1]]
## [1] 1
at(bimap, keys[1], "values")
## [1] 1
at(bimap, values[2], "keys")
## [1] "V2"
at(bimap, keys[1])
## [1] 1
at(mm, keys[2])
## [[1]]
## a X2 X3
## 1 0.1351150 2 3
## 2 0.4606338 2 3
## 3 -0.4498902 2 3
## 4 0.1067527 2 3
## 5 -0.1667827 2 3
##
## [[2]]
## [[2]]$a
## [1] 1
mm[keys[5]]
## [[1]]
## [1] 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1
For hashmaps and multimaps the subset-operator [
is also supported for accessing elements. This does not make so much sense for bimaps, because here we have to choose whether we want to access keys or values. In the example on the top we showed how both is possible by either providing keys
or values
as third argument to the get
function. If you leave it out, then the third argument defaults to values
.
If you want to directly access keys
or values
as vectors or have a look at a few random elements, you would call:
keys(bimap)
## [1] "V1" "V2" "V3" "V4" "V5"
values(hash)
## [[1]]
## [1] 5
##
## [[2]]
## [1] 3
##
## [[3]]
## [1] 2
##
## [[4]]
## [1] 4
##
## [[5]]
## [1] 1
peek(bimap)
## $V1
## [1] 1
##
## $V2
## [1] 2
##
## $V3
## [1] 3
##
## $V4
## [1] 4
##
## $V5
## [1] 5
Removing entries in all of the three map classes works like this:
hash <- erase(hash, keys[1]) hash
## An object of class hashmap<character,T>
##
## V2 -> integer
## V3 -> integer
## V4 -> integer
## V5 -> integer
mm <- erase(mm, keys[2])
## Note: method with signature 'map#vector#missing' chosen for function 'erase',
## target signature 'multimap#character#missing'.
## "multimap#vector#ANY" would also be valid
mm
## An object of class multimap<character,T>
##
## V5 -> numeric
As you can see for multimaps, all key-value pairs get removed, when the function is called as above. If you are interested in only one specific key-value pair, you additionally need to supply the respective value:
mm <- insert(mm, keys[c(2, 2)], list(list(a=1), data.frame(a=rnorm(5), 2, 3))) mm <- erase(mm, keys[2], list(a=1)) mm[ keys[2] ]
## [[1]]
## a X2 X3
## 1 -0.43713121 2 3
## 2 0.59779861 2 3
## 3 0.07946596 2 3
## 4 -0.05497089 2 3
## 5 -1.85216964 2 3
For bimaps we can also remove from the other side, i.e. when we provide a value:
bimap <- erase(bimap, value=values[1]) bimap
## An object of class bimap<character,T>
##
## V2 <--> integer
## V3 <--> integer
## V4 <--> integer
## V5 <--> integer
For multimaps, where we allow having the same key several times, removing an entry as above removes all the key-value pairs with the
The three different map types support to easily clear the entire data structure using:
clear(bimap)
## An object of class bimap<character,T>
##
## NULL <--> NULL
size(bimap)
## [1] 0
clear(hash)
## An object of class hashmap<character,T>
##
## NULL -> NULL
size(hash)
## [1] 0
clear(mm)
## An object of class multimap<character,T>
##
## NULL -> NULL
size(mm)
## [1] 0
Queues and stackes are two list datastructures using the STL
’s std::deque
in the backend, i.e. insertion at the end and getting the first element can be done in constant time \(\mathcal{O}(1)\) . Queues and stacks are for example used in depths-first-search and breadth-first-search making them versatile datastructures. While queues use the first-in-first-out principle, stacks use the last-in-first-out principle.
The two datatypes use the exact same methods. You can instantiate a stack or a queue using:
qu <- queue() st <- stack()
As we heaps and hashmap/multimaps you acn pack anything you want into the containers. Now, let’s again insert some data. Single values can just be supplied as argument, while multiple values need to be packed into a list again.
qu <- insert(qu, 1) qu <- insert(qu, list(rnorm(5), 1, diag(3))) st <- insert(st, as.list(rnorm(5))) st <- insert(st, data.frame(a=1:3, b=3:1))
BEWARE: If you add one vector
to queues and stacks it will add one element:
peek(qu)
## [1] 1
peek(st)
## a b
## 1 1 3
## 2 2 2
## 3 3 1
If you want to remove elements, call the pop
function which will remove the head of the data structure.
pop(st)
## a b
## 1 1 3
## 2 2 2
## 3 3 1
pop(st)
## [1] -0.6635281
For a queue
pop
and peek
return the first element that got inserted, while a stack returns the last element that was inserted.
The size of a deque can be computed by calling:
size(qu)
## [1] 4
Stacks and queues can be emptied like this:
qu <- clear(qu) qu
## An object of class queue<SEXP>
##
## Peek: NULL, ...
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.1 LTS
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-openmp/libopenblasp-r0.3.8.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_CH.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=de_CH.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=de_CH.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_CH.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] purrr_0.3.4 datastructures_0.2.9 Rcpp_1.0.5
##
## loaded via a namespace (and not attached):
## [1] codetools_0.2-16 rprojroot_1.3-2 digest_0.6.25 crayon_1.3.4
## [5] assertthat_0.2.1 MASS_7.3-51.6 R6_2.4.1 backports_1.1.8
## [9] magrittr_1.5 evaluate_0.14 stringi_1.4.6 rlang_0.4.7
## [13] rstudioapi_0.11 fs_1.5.0 rmarkdown_2.3 pkgdown_1.5.1
## [17] desc_1.2.0 tools_4.0.2 stringr_1.4.0 yaml_2.2.1
## [21] xfun_0.16 compiler_4.0.2 memoise_1.1.0 htmltools_0.5.0
## [25] knitr_1.29