HBmethod {univOutl} | R Documentation |
This function implements the method proposed by Hidiroglou and Berthelot (1986) to identify outliers in periodic data, i.e. when the same variable is measured at two time points.
HBmethod(yt1, yt2, U=0.5, A=0.05, C=4, pct=0.25,
id=NULL, std.score=FALSE, return.dataframe=FALSE, adjboxE=FALSE)
yt1 |
Numeric vector providing the values observed at time t1. |
yt2 |
Numeric vector providing the values observed at time t2 (t2 > t1). |
U |
Numeric, parameter needed to determine the ‘importance’ of a ratio. The value should lie in [0, 1] interval; commonly used values are 0.3, 0.4, or 0.5 (default) (see Details for further information). |
A |
Numeric, parameter needed when computing the scale measure used to derive the bounds. Hidiroglou and Berthelot (1986) suggest setting |
C |
Numeric, parameter determining the extension of the interval; greater values will provide larger intervals, i.e. fewer expected outliers. Values commonly used are 4 (default) or 7, but also values close or grater than 40 can be used in some particular cases. Note that two C values can be provided instead of one, the first one will be used to determine the left tail bound, while the second determines the right tail bound; this setting can help in improving outlier detection in skewed distributions (see Details for further information). |
pct |
Numeric, the percentage point of the scores that will be used to calculate the lower and upper bounds. By default, |
id |
Optional numeric or character vector, with identifiers of units. If |
std.score |
Logical, if |
return.dataframe |
Logical, if |
adjboxE |
Logical (default |
The method proposed by Hidiroglou and Berthelot (1986) to identify outliers in periodic data consists in deriving a score variable based on the ratios r_i = y_{i,t2}/y_{i,t1}
(yt2/yt1
) with i=1,2,\ldots, n
being n
the number of observations after discarding NAs and 0s in both yt1
and yt2
.
At first the ratios are centered around their median r_M
:
s_i = 1 - r_M/r_i \qquad \mbox{if} \qquad 0 < r_i < r_M
s_i = r_i/r_M - 1 \qquad \mbox{if} \qquad r_i \geq r_M
Then, in order to account for the magnitude of data, the following score is derived:
E_i = s_i \times [ max(y_{i,t1},y_{i,t2} ) ]^U
Finally, the interval is calculated as:
(E_M-C \times d_{Q1}, E_M+C\times d_{Q3} )
where
d_{Q1} = max( E_M - E_{Q1}, |A \times E_M| )
and d_{Q3} = max( E_{Q3} - E_M, |A \times E_M| )
being E_{Q1}
, E_M
and E_{Q3}
the quartiles of the E scores (when pct = 0.25
, default)).
Recently Hidiroglou and Emond (2018) suggest using percentiles P10 and P90 of the E scores in replacement of respectively Q1 and Q3 to avoid the drawback of many units identified as outliers; this is likely to occur when a large proportion of units (>1/4) has the same ratio. P10 and P90 are achieved by setting pct = 0.10
when running the function.
In practice, all the units with an E score outside the interval are considered as outliers. Notice that when two C
values are provided, then the first is used to derive the left bound while the second determines the right bound.
When std.score=TRUE
a standardized score is derived in the following manner:
z_{E,i} = g \times \frac{E_i - E_M}{d_{Q1}} \qquad \mbox{if} \qquad E_i < E_M
z_{E,i} = g \times \frac{E_i - E_M}{d_{Q3}} \qquad \mbox{if} \qquad E_i \geq E_M
The constant g is set equal to qnorm(1-pct)
and makes d_{Q1}
and d_{Q3}
approximately unbiased estimators when the E scores follow the normal distribution.
When adjboxE = TRUE
outliers on the E scores will all be searched using the boxplot adjusted for skewness as implemented in the function boxB
when run with with argument method = "adjbox"
.
A list whose components depend on the return.dataframe
argument. When return.dataframe=FALSE
just the following components are provided:
median.r |
the median of the ratios |
quartiles.E |
Quartiles of the E score |
bounds.E |
Bounds of the interval of the E score, values outside are considered outliers. |
excluded |
The identifiers or positions (when |
outliers |
The identifiers or positions (when |
outliersBB |
The identifiers or positions (when |
When return.dataframe=TRUE
, the first three components remain the same with, in addition, two dataframes:
excluded |
A dataframe with the subset of observations excluded. The data frame has the following columns: 'id' (units' identifiers), 'yt1' columns 'yt2' |
data |
A dataframe with the the not excluded observations and the following columns: ‘id’ (units' identifiers), ‘yt1’, ‘yt2’, ‘ratio’ (= yt1/yt2), ‘sizeU’ (=max(yt1, yt2)^U),‘Escore’ (the E scores, see Details), ‘std.Escore’ (the standardized E scores when |
Marcello D'Orazio mdo.statmatch@gmail.com
Hidiroglou, M.A. and Berthelot, J.-M. (1986) ‘Statistical editing and Imputation for Periodic Business Surveys’. Survey Methodology, Vol 12, pp. 73-83.
Hidiroglou, M.A. and Emond, N. (2018) ‘Modifying the Hidiroglou-Berthelot (HB) method’. Unpublished note, Business Survey Methods Division, Statistics Canada, May 18 2018.
set.seed(222)
x0 <- rnorm(30, 50, 5)
x0[1] <- NA
set.seed(333)
rr <- runif(30, 0.9, 1.2)
rr[10] <- 2
x1 <- x0 * rr
x1[20] <- 0
out <- HBmethod(yt1 = x0, yt2 = x1)
out$excluded
out$median.r
out$bounds.E
out$outliers
cbind(x0[out$outliers], x1[out$outliers])
out <- HBmethod(yt1 = x0, yt2 = x1,
return.dataframe = TRUE)
out$excluded
head(out$data)