# Estimating the Size of a Demonstration

## Abstract

Inspired by the recent **March For Science** we look
into methods for the statistical estimation of the number of people
participating in a demonstration organized as a march. In particular, we
provide R code to reproduce the **two on-the-spot counting
method** analysis of Yip et al. (2010) for the data of the
July 1 March in Hong Kong 2006.

This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License. The
markdown+Rknitr source code of this blog is available under a GNU General Public
License (GPL v3) license from github.

## Introduction

Exercising your democratic right to express support for a cause by
demonstration has found anew usage. The March for
Science is a recent examples of such a demonstration inspired by
recent political developments. The number of persons participating in
such marches is *the* indicator by which the support of the cause
is measured. Crowd size estimates have therefore always been subject to
political interpretation and, hence, possible politically motivated
bias. In this work we focus on what statistics has to offer with respect
to finding the **true number** of participants. A good
overview of this task’s challenges is given in Watson and Yip (2011). A particular
difficulty is the size estimation of moving crowds as seen in
marches.

As case study we replicate the analysis of Yip et al. (2010) on estimating the number of participants in the 1st of July Marches in Hong Kong. Since the handover to China in 1997 these marches have been conducted yearly to demonstrate for democracy and freedom of speech in Hong Kong. Below is shown the 3.6 km long demonstration route from Victoria Park to Government Headquarters for the 2006 demonstration as described by Yip et al. (2010). A youtube video of the 2006 demonstration illustrates this better than words.

In order to estimate the number of participants a **two
on-the-spot counting method** was devised by Yip et al. (2010): Two
points along the march were selected as shown in the above map: Point A
denotes the location after which an individual is counted as being part
of the march. In order to take into account that people join the march
at a later point than A, a second point B is selected to adjust the
count at A for such late entries. Three to four persons were stationed
at each of the two counting locations. Once the demonstration passed the
respective point each of them started to count the number of people
passing in a one-minute intervals. They counted for one minute every 5
minutes until the last person of the march had passed the counting
point.

### Loading the Data

We store the resulting counting data displayed in Table 2 and Table 3
of Yip et al.
(2010) as two Excel-files. In a data pre-processing step these
are then read and combined into one data.frame containing the columns
`Y1`

-`Y4`

. Furthermore, we re-format the table’s
time specification to proper POSIX formatted date-times. The exact data
dancing steps can be found in the accompanying Rmd
code of this post. Altogether, this yields a `tbl`

with
the first couple of lines looking as follows:

```
## # A tibble: 6 × 7
## Y1 Y2 Y3 Y4 Mean Point Time_POSIX
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dttm>
## 1 150 NA 160 180 163.3333 A 2006-07-01 15:55:00
## 2 308 360 250 280 299.5000 A 2006-07-01 16:00:00
## 3 430 350 300 270 337.5000 A 2006-07-01 16:05:00
## 4 210 280 240 252 245.5000 A 2006-07-01 16:10:00
## 5 130 216 200 180 181.5000 A 2006-07-01 16:15:00
## 6 210 260 300 280 262.5000 A 2006-07-01 16:20:00
```

We then compute a number of row-wise statistics for all columns
containing the counts - which columns contain the counts is specified by
a regular expression `ccol_regexp`

. In our case would be
`"^Y[0-9]+"`

.

### Descriptive Statistics

The counts of the 4 counters at point A and the 3 counters at point B are summarized in the following small table:

Point | n_counters | n_timepoints | sum_of_the_mean_counts |
---|---|---|---|

A | 4 | 22 | 4849.50 |

B | 3 | 26 | 4746.67 |

A time series for the individual counts as well as their mean is shown below. One observes that at point B the intensity of the crowd was lower, as the observation had stretched over a larger distance. The later is seen from the time span between the first and last count for the two points: approximately 1:45h for A vs. 2:45h for B.

## Two On-the-Spot Counting Method

Below we give the mathematical details of the two on-the-spot counting method. Consider the counting point \(X\) of the march, i.e. \(X\in \{A,B\}\). Let \(m_X\) be the number of counters at this point. Assume that the first people of the march pass \(X\) at time point \(a_X\) and that last people of the march reach \(X\) at time point \(e_X\). The time unit could for example be minutes. Counting is done such that at regular intervals \(c\) one counts all people passing the point of observation within a time block of 1 unit - say 1 minute. Let \(k_X\) denote the number of time points where observations are available at \(X\). Hence, the \(k_X\) observations at \(X\) are available for the time points \(a_X,a_X+c,a_X+2c,a_X+(k_X-1)c\). Denote by \(Y_{X,i}(t)\) the \(i\)’th person’s count at time \(t\). Then

\[\overline{Y}_X(t) = \frac{1}{m_X} \sum_{i=1}^{m_X} Y_{X,i}(t)\]

is the average of the observer’s counts at point \(X\) for time \(t\). By scaling up each observer’s observations to account for the time blocks without a count and averaging over the different observers we get an estimate for the number of participants at point \(X\):

\[ \hat{N}_X = \frac{e_X}{k_X} \sum_{j=1}^{k_X} \overline{Y}_X(a_X + (j-1)c).\]

In most cases one would have that \(e_X/k_X=c\). As an example: If a counter
counts 200 people for *every* 1-minute-counting-block during
two-hours, i.e. corresponding to 24 observations (one every five
minutes), her estimate for \(N_X\)
would be 200\(\cdot\) 24\(\cdot\) 5= 24000.

In order to adjust the estimate at point \(A\) for people who joined the march after
point \(A\), we perform an independent
counting at point \(B\) and
additionally ask \(m\) people at point
\(B\), whether they marched past point
\(A\) or not. Denoting \(\hat{\phi}\) the fraction of people
answering yes to this question the **two on-the-spot counting
estimator** is \[
\hat{N} = \hat{N}_A + (1-\hat{\phi}) \hat{N}_B.
\] Note that this estimator does not take into account that
people could potentially leave the march between \(A\) and \(B\) and that its also possible to join the
march after \(B\). However, the
proportion of such participants is assumed to be negligible.

A confidence interval (CI) based on an asymptotic normal assumption can be obtained by deriving that \[ \operatorname{se}(\hat{N}) = \sqrt{\widehat{\operatorname{Var}}(\hat{N}_A) + (1-\hat{\phi})^2 \widehat{\operatorname{Var}}(\hat{N}_B) + \hat{N}_B^2 \frac{\hat{\phi}(1-\hat{\phi})}{m}}, \] where we have used that \[ \widehat{\operatorname{Var}}(\hat{N}_X) = \frac{e_X^2}{k_X^2} \sum_{j=1}^{k_X} \widehat{\operatorname{Var}}(\overline{Y}_X(a_X + (j-1)c)) = \frac{e_X^2}{k_X^2} \sum_{j=1}^{k_X} \frac{\widehat{\operatorname{Var}}(Y_X(a_X + (j-1)c))}{m_X} \] and \[ \widehat{\operatorname{Var}}(Y_X(t)) = \frac{1}{m_X-1}\sum_{i=1}^{m_X} (Y_{X,i}(t) - \overline{Y}_X(t)). \]

A two-sided \((1-\alpha)\cdot 100\%\) CI is then constructed as \(\hat{N} \pm z_{1-\alpha/2} \operatorname{se}(\hat{N})\), where \(z_{1-\alpha/2}\) is the \(1-\alpha/2\) quantile of the standard normal distribution. To get a 95% CI the value is \(z_{1-0.05/2}=1.96\). Since \(N\) is expected to be at least of moderate size before one bothers counting this asymptotic CI should have decent coverage.

### Implementation in R

The above equations have been implemented as function
`two_on_the_spot_N`

in R, which given a `counts`

data.frame computes the estimate and a corresponding confidence
interval. The github
code of this post contains the details.

`args(two_on_the_spot_N)`

```
## function (counts, ccol_regexp = "^Y[0-9]+", phi_estim, c = 5,
## conf.level = 0.95)
## NULL
```

Among 480 interviewed persons at point B, 437 reported to also have passed point A. In other words \(\hat{\phi}\)=91% and we obtain \(\hat{N}\) as follows with R:

```
##Compute the two on the spot estimate based on the data in counts
<- two_on_the_spot_N(counts, phi_estim=c(437,480),conf.level=0.95)
N
##Rounded version
with(N, round(c(estimate=estimate,ci)/100)*100)
```

```
## estimate ci_lower ci_upper
## 26400 25600 27200
```

Our estimate for the number of participants is thus around 26400 with a 95% confidence interval of 25600-27200. For comparison the authors state that the Hong Kong Police’s estimate was around 28000, whereas the organizers claimed a size of 58000.

## Discussion

We were able to reproduce the results of Yip et al. (2010) using the article’s data (up to some rounding issues). An R function is now available for supporting mobile crowd estimation in the future. It will be interesting to synthesize the traditional and easy to implement counting approach described above with more modern data sources such as mobile phone or twitter data (Botta, Moat, and Preis 2015).

Picture
Source: Ding Yuin Shan, Hong Kong, licensed under the Creative
Commons Attribution 2.0 Generic license.

QED.

## References

*Royal Society Open Science*2 (5). https://doi.org/10.1098/rsos.150162.

*Significance*8 (3): 104–7. https://doi.org/10.1111/j.1740-9713.2011.00502.x.

*Australian & New Zealand Journal of Statistics*52 (1): 17–26. https://doi.org/10.1111/j.1467-842X.2009.00562.x.