f_group_by {fastplyr} | R Documentation |
'collapse' version of dplyr::group_by()
Description
This works the exact same as dplyr::group_by()
and typically
performs around the same speed but uses slightly less memory.
Usage
f_group_by(
data,
...,
.add = FALSE,
.order = df_group_by_order_default(data),
.by = NULL,
.cols = NULL,
.drop = df_group_by_drop_default(data)
)
group_ordered(data)
f_ungroup(data)
Arguments
data |
data frame. |
... |
Variables to group by. |
.add |
Should groups be added to existing groups?
Default is |
.order |
Should groups be ordered? If |
.by |
(Optional). A selection of columns to group by for this operation.
Columns are specified using |
.cols |
(Optional) alternative to |
.drop |
Should unused factor levels be dropped? Default is |
Details
f_group_by()
works almost exactly like the 'dplyr' equivalent.
An attribute "ordered" (TRUE
or FALSE
) is added to the group data to
signify if the groups are sorted or not.
Ordered vs Sorted
The distinction between ordered and sorted is somewhat subtle.
Functions in fastplyr that use a sort
argument generally refer
to the top-level dataset being sorted in some way, either by sorting
the group columns like in f_expand()
or f_distinct()
, or
some other columns, like the count column in f_count()
.
The .order
argument, when set to TRUE
(the default),
is used to mean that the group data will be calculated
using a sort-based algorithm, leading to sorted group data.
When .order
is FALSE
, the group data will be returned based on
the order-of-first appearance of the groups in the data.
This order-of-first appearance may still naturally be sorted
depending on the data.
For example, group_id(1:3, order = T)
results in the same group IDs
as group_id(1:3, order = F)
because 1, 2, and 3 appear in the data in
ascending sequence whereas group_id(3:1, order = T)
does not equal
group_id(3:1, order = F)
Part of the reason for the distinction is that internally fastplyr
can in theory calculate group data
using the sort-based algorithm and still return unsorted groups,
though this combination is only available to the user in limited places like
f_distinct(.order = TRUE, .sort = FALSE)
.
The other reason is to prevent confusion in the meaning
of sort
and order
so that order
always refers to the
algorithm specified, resulting in sorted groups, and sort
implies a
physical sorting of the returned data. It's also worth mentioning that
in most functions, sort
will implicitly utilise the sort-based algorithm
specified via order = TRUE
.
Using the order-of-first appearance algorithm for speed
In many situations (not all) it can be faster to use the
order-of-first appearance algorithm, specified via .order = FALSE
.
This can generally be accessed by first calling
f_group_by(data, ..., .order = FALSE)
and then
performing your calculations.
To utilise this algorithm more globally and package-wide,
set the '.fastplyr.order.groups' option to FALSE
using the code:
options(.fastplyr.order.groups = FALSE)
.
Value
f_group_by()
returns a grouped_df
that can be used
for further for grouped calculations.
group_ordered()
returns TRUE
if the group data are sorted,
i.e if attr(attr(data, "groups"), "ordered") == TRUE
. If sorted,
which is usually the default, this leads to summary calculations
like f_summarise()
or dplyr::summarise()
producing sorted groups.
If FALSE
they are returned based on order-of-first appearance in the data.