From the table sizes
\(s_{Q_{i},\textit{table}}\) of each query
\(Q_{i}\) and a scan rate of
\(r_{\textit{scan}}\), we can calculate the scanning time as
$$t_{Q_{i},scan}=\frac{s_{Q_{i},\textit{table}}}{r_{\textit{scan}}}.$$
(1)
The reconfiguration time
\(t_{r}\) is based on the size of the PRs of our RPU. With a (constant) data rate
\(r_{\textit{acc}}\) of the accelerators, we can calculate the processing time of the first accelerator in the standard plan
S as
$$t_{S,Q_{0},\textit{acc}_{0}}=\frac{s_{Q_{0},\textit{table}}}{r_{\textit{acc}}}.$$
(2)
According to our assumptions, all our accelerators are filters, so they reduce the size of the table by a selectivity factor of
\(f_{Q_{i},\textit{acc}_{j}}\). This leads to an intermediate table of size
$$s_{S,Q_{0},\textit{intermediate}}=s_{Q_{0},\textit{table}}\times f_{Q_{0},\textit{acc}_{0}}$$
(3)
and a final result size of
$$s_{S,Q_{0},\textit{result}}=s_{S,Q_{0},\textit{intermediate}}\times f_{Q_{0},\textit{acc}_{1}}.$$
(4)
For query
\(Q_{1}\) with only one accelerator in use, we get
$$s_{Q_{1},\textit{result}}=s_{Q_{1},\textit{table}}\times f_{Q_{1},\textit{acc}_{0}}.$$
(5)
This will not change in the other plans. The result of both queries is transferred to the host at a constant network data rate of
\(r_{\textit{network}}\), resulting in the transfer times:
$$t_{S,Q_{0},\textit{trans}}=\frac{s_{S,Q_{0},\textit{result}}}{r_{\textit{network}}}\quad t_{Q_{1},\textit{trans}}=\frac{s_{Q_{1},\textit{result}}}{r_{\textit{network}}}.$$
(6)
Overall, we get an execution time for the standard plan
S of:
$$\begin{aligned}t_{S,Q_{0}} & =\max(t_{r},t_{Q_{0},\textit{scan}})+t_{S,Q_{0},\textit{acc}_{0}}+t_{r}+t_{S,Q_{0},\textit{acc}_{1}}+t_{S,Q_{0},\textit{trans}}\end{aligned}$$
(7)
$$\begin{aligned}t_{S,Q_{1}} & =\max(t_{r},t_{Q_{1},\textit{scan}})+t_{Q_{1},\textit{acc}_{0}}+t_{Q_{1},\textit{trans}}\end{aligned}$$
(8)
$$\begin{aligned}t_{S} & =t_{S,Q_{0}}+t_{\textit{gap},Q_{0}}+t_{S,Q_{1}}.\end{aligned}$$
(9)
In strategy
I, we save the reconfiguration time needed for
\(Q_{1}\) and execute the table scan for
\(Q_{1}\) as soon as possible. So the execution time for
\(Q_{1}\) is reduced to
$$t_{I,Q_{1}}=t_{Q_{1},\textit{acc}_{0}}+t_{Q_{1},\textit{trans}}.$$
(10)
However, a larger transfer time
\(t_{I,Q_{0},\textit{trans}}\) is needed in
\(Q_{0}\) and also the time
\(t_{I,Q_{0},\textit{DBMS}}\) for DBMS post-processing. So the calculation changes to:
$$t_{I}=\max(\max(t_{r},t_{Q_{0},\textit{scan}})+t_{I,Q_{0},\textit{acc}_{0}}+t_{I,Q_{0},\textit{trans}}+t_{I,Q_{0},\textit{DBMS}}+t_{\textit{gap},Q_{0}},t_{Q_{0},\textit{scan}}+t_{Q_{1},\textit{scan}})+t_{I,Q_{1}}.$$
(11)
The time
\(t_{I,Q_{0},\textit{DBMS}}\) for the DBMS operation execution can be derived from the processing rate
\(r_{\textit{dbms}}\) of the DBMS, which has been measured on the host.
Strategies
II and
III try to hide the reconfiguration time while avoiding the transfer of unnecessary data. The execution time
\(t_{\textit{II}}\) for strategy
II is calculated as:
$$t_{\textit{II}}=\max(\max(t_{r},t_{Q_{0},\textit{scan}})+t_{\textit{II},Q_{0},\textit{acc}_{0}}+t_{r}+t_{\textit{II},Q_{0},\textit{acc}_{1}}+\max(t_{\textit{II},Q_{0},\textit{trans}}+t_{\textit{gap},Q_{0}},t_{r}),t_{Q_{0},\textit{scan}}+t_{Q_{1},\textit{scan}})+t_{\textit{II},Q_{1}}$$
(12)
with
\(t_{\textit{II},Q_{1}}=t_{I,Q_{1}}\). In case of strategy
III, we obtain:
$$t_{\textit{III}}=\max(\max(t_{r},t_{Q_{0},\textit{scan}})+t_{\textit{III},Q_{0},\textit{acc}_{1}}+t_{r}+t_{\textit{III},Q_{0},\textit{acc}_{0}}+t_{\textit{III},Q_{0},\textit{trans}}+t_{\textit{gap},Q_{0}},t_{Q_{0},\textit{scan}}+t_{Q_{1},\textit{scan}})+t_{\textit{III},Q_{1}}$$
(13)
with
\(t_{\textit{III},Q_{1}}=t_{I,Q_{1}}\).
To validate our evaluation model, we have used a test setup. We have run 3 queries (
\(Q_{0},Q_{1},Q_{2}\)) on our FPGA-based prototype.
\(Q_{0}\) has a
WHERE
clause using 2 columns and a selectivity factor of
\(0.1\).
\(Q_{1}\) has a
WHERE
clause using only 1 column and also a selectivity factor of
\(0.1\).
\(Q_{2}\) has a
WHERE
clause using also only 1 column, but a selectivity factor of
\(0.2\). Our prototype has a measured accelerator rate
\(r_{\textit{acc}}\) of
\(\approx\)1,544 MB/s. The results are shown in Table
1.
This evaluation model is actually very close to what is really happening in our system. Our hardware accelerators are built to process the incoming data at line-rate, and each accelerator must process all incoming data. This means that processing a dataset of size
\(s_{Q_{i},\textit{table}}\) takes the time given in Eqn.
2. Moreover, the cost of predicate evaluation by the accelerator is constant in terms of time or throughput, as we have explained in Section
2. So, all in all, our proposed evaluation model is actually the basis for cost-based plan enumerations—which are however not in the scope of this paper, but of our immediate future work.
Table 1
Comparison of measured and modeled accelerator times
table size (\(s\)) | Query | \(f\) | \(t_{\textit{acc}}\) [ms] |
[bytes] | measured | model (\(s/r_{\textit{acc}}\)) |
1,200,000 | \(Q_{0}\) | 0.1 | 0.781 | 0.777 |
\(Q_{1}\) | 0.1 | 0.781 | 0.777 |
\(Q_{2}\) | 0.2 | 0.782 | 0.777 |
2,400,000 | \(Q_{0}\) | 0.1 | 1.542 | 1.554 |
\(Q_{1}\) | 0.1 | 1.537 | 1.554 |
\(Q_{2}\) | 0.2 | 1.537 | 1.554 |