Appendices
In this Appendices, we discuss the implementation of the Probabilistic Finite State Machines (PFSMs) (“Appendix A”), describe the data sets used (“Appendix B”) and present the derivations for training and inference in our model (“Appendix C and D”, respectively). Moreover, we present additional experimental results (“Appendix F”) and report scalability of the methods (“Appendix G”).
PFSMs for data types
In this work, we use five regular data types including integers, strings, floats, Booleans and dates; and two noisy data types, namely missing and anomaly.
Integers
Please see Sect. 2.1 for a detailed discussion of the PFSM used to represent integers.
Floats
A floating-point number usually consists of digits and a full stop character, which is followed by another set of digits. However, they can also be written without any fractional component, i.e., as integer numbers. We also support the representations of floating-point numbers with e or E. Lastly, we support the use of comma for the thousands separator in floating-point numbers, such as 1, 233.15, 1, 389, 233.15, etc.
Strings
The string PFSM is constructed with one initial state and one final state. Through each transition, either a digit, an alpha character, or a punctuation character is emitted. The punctuation characters considered here are ., ,, -, _, %, : and ;, which are commonly found in real-world data sets to represent columns with the string type.
Booleans
Our machine supports the following values by assigning them non-zero probabilities: Yes, No, True, False, 1, 0, -1 and their variants yes, Y, y, no, true, false.
Dates
We categorize date formats into two groups, which are detailed below:
ISO-8601
We support values in YYYY-MM-DDTHH:MM::SS, where T is the time designator to indicate the start of the representation of the time of day component. We also support other ISO-8601 formats such as YYYYMMDD, YYYY-MM-DD, HH:MM and HH:MM:SS.
Nonstandard formats
We treat years in YYYY format as date type. To distinguish years and integers, we restrict this format to the range of [1000-2999]. On the other hand, we do not explicitly constrain the month (MM) and day columns (DD) to valid ranges, and but treat them as integers. We support ranges of years with the formats of YYYY–YYYY, YYYYYYYY, YYYY–YYYY, YYYY–YYYY and YYYY–YYYY. Lastly, We support dates written as MM-DD-YYYY HH:MM:SS AM/PM and months, e.g., January, February, etc.
Missing
The machine for missing data assigns non-zero probabilities to the elements of this set, including Null, NA and their variants such as NULL, null, “NA ”, NA, “N A”, N/A, “N/ A”, “N /A”, N/A, #NA, #N/A, na, “ na”, “na ”, “n a”, n/a, N/O, NAN, NaN, nan, -NaN and -nan; special characters such as -, !, ?, * and .; integers such as 0, -1, -9, -99, -999, -9999 and -99999; and characters denoting empty cells such as “” and “ ”.
Anomaly
We use all of the Unicode characters in this machine’s alphabet, including the accented characters. Note that the number of elements in this set is 1,114,112.
Table 4 Information about the data sets used Data sets
We share the available data sets and the corresponding annotations at https://goo.gl/v298ER. Here, we briefly describe these data sets, and provide a list in Table 4 which denotes their sources, and whether they are used in the training or testing phase.
-
Accident 2016 information on accidents casualties across Calderdale, including location, number of people and vehicles involved, road surface, weather conditions and severity of any casualties.
-
Accidents 2015 a file from Road Safety data about the circumstances of personal injury road accidents in GB from 1979.
-
Adult a data set extracted from the U.S. Census Bureau database to predict whether income exceeds $50K/yr.
-
Auto a data set consisting of various characteristics of a car, its assigned insurance risk rating and its normalized losses in use.
-
Broadband annual survey of consumer broadband speeds in the UK.
-
Billboard a data set on weekly Hot 100 singles, where each row represents a song and the corresponding position on that week’s chart.
-
Boston Housing a data which contains census tracts of Boston from the 1970 census.
-
BRFSS a subset of the 2009 survey from BRFSS, an ongoing data collection program designed to measure behavioral risk factors for the adult population.
-
Canberra Observations weather and climate data of Canberra (Australia) in 2013.
-
Casualties 2015 a file from Road Safety data about the consequential casualties.
-
Census Income KDD a data set that contains weighted census data extracted from the 1994 and 1995 current population surveys conducted by the U.S. Census Bureau.
-
CleanEHR (Critical Care Health Informatics Collaborative) anonymised medical records.Footnote 9
-
Cylinder Bands a data set used in decision tree induction for mitigating process delays known as “cylinder bands” in rotogravure printing.
-
data.gov 9 CSV files obtained from data.gov, presenting information such as the age-adjusted death rates in the U.S., Average Daily Traffic counts, Web traffic statistics, the current mobile licensed food vendors statistics in the City of Hartford, a history of all exhibitions held at San Francisco International Airport by SFO Museum, etc.
-
EDF Stocks EDF stocks prices from 23/01/2017 to 10/02/2017.
-
El Niño: a data set containing oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific.
-
FACA Member List 2015 data on Federal Advisory Committee Act (FACA) Committee Member Lists.
-
French Fries a data set collected from a sensory experiment conducted at Iowa State University in 2004 to investigate the effect of using three different fryer oils on the taste of the fries.
-
Fuel Fuel Economy Guide data bases for 1985-1993 model.
-
Geoplaces2 information about restaurants (from UCI ML Restaurant & consumer data).
-
HES (Household Electricity Survey) time series measurements of the electricity use of domestic appliances (to gain access to the data, please register at https://tinyurl.com/ybbqu3n3).
-
Housing Price a data set containing 79 explanatory variables that describe (almost) every aspect of residential homes in Ames, Iowa.
-
Inspection Outcomes local authority children’s homes in England - inspection and outcomes as at 30 September 2016.
-
Intel Lab: a data set collected from sensors deployed in the Intel Berkeley Research lab, measuring timestamped topology information, along with humidity, temperature, light and voltage.
-
mass.gov 4 CSV files obtained from mass.gov, which is the official website of the Commonwealth of Massachusetts.
-
MINY Vendors information on “made in New York” Vendors.
-
National Characteristics information on the overall, authorised, unauthorised and persistent absence rates by pupil characteristics.
-
One Plus Sessions information on the number of enrollments with one or more session of absence, including by reason for absence.
-
Pedestrian a count data set collected in 2016, that denotes the number of pedestrians passing within an hour.
-
PHM Collection information on the collection of Powerhouse Museum Sydney, including textual descriptions, physical, temporal and spatial data as well as, where possible, thumbnail images.
-
Processed Cleveland a data set concerning heart disease diagnosis, collected at Cleveland Clinic Foundation (from the UCI ML Heart Disease Data Set).
-
Sandy Related Hurricane Sandy-related NYC 311 calls.
-
Reported Taser 2015: a hand-compiled raw data set based on forms filled out by officers after a stun gun was used in an incident, provided by CCSU’s Institute for Municipal and Regional Policy.
-
Rodents the information collected on rodents during a survey.
-
Survey: a data set from a 2014 survey that measures attitudes towards mental health and frequency of mental health disorders in the tech workplace.
-
TAO a real-time data collected by the Tropical Atmosphere Ocean (TAO) project from moored ocean buoys for improved detection, understanding and prediction of El Niño and La Niña.
-
Tb a tuberculosis dataset collected by the World Health Organisation which records the counts of confirmed tuberculosis cases by “country”, “year” and demographic group.
-
Tundra Traits measurements of the physical characteristics of shrubs in the arctic tundra.
-
T2Dv2 Gold Standard a set of data Web tables to evaluate matching systems on the task of matching Web tables to the DBpedia knowledge base.
-
User Profile information about consumers (from UCI ML Restaurant & consumer data).
-
Vehicles 2015 a file from Road Safety data about the types of vehicles involved in the accidents.
-
83492acc-1aa2-4e80-ad05-28741e06e530: a hypoparsr data set which contains information on expenses.
Note that the data sets from mass.gov and data.gov are obtained from Abdulhakim A. Qahtan and also used in [16].
Derivations for the training
The task is to update the parameters of the PFSMs, given a set of columns X and their column types \(\mathbf{t} \). Since the columns are assumed to be independent, the gradient can be calculated by summing the gradient of each column. In the interest of simplicity, here we only derive the gradient for a column \(\mathbf{x} \) of type k. We would like to maximize the posterior probability of the correct type k given a data column \(\mathbf{x} \), which can be rewritten as follows:
$$\begin{aligned} \log p(t=k|\mathbf{x} )&= \underbrace{\log p(t=k,\mathbf{x} )}_{L_c} - \underbrace{\log p(\mathbf{x} )}_{L_f}. \end{aligned}$$
(3)
We now present the derivations of the gradients w.r.t. the transition parameters where \(\theta ^\tau _{q,\alpha ,q'}\) denotes the transition parameter from state q to \(q'\) emitting the symbol \(\alpha \) in the \(\tau \)th PFSM. Note that \(\tau \in \{1,\dots ,K\}\) where K is the number of PFSMs.
We now derive these two terms, in section (4.1) and (4.2) respectively.
Derivative of \(L_c\)
$$\begin{aligned} \frac{\partial L_c}{\partial \theta _{q,\alpha ,q'}^\tau }&= \frac{\partial \log p(t=k,\mathbf {x})}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \frac{\partial \Big ( \log p(t=k) \prod _{i=1}^N p(x_i|t=k)\Big )}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \frac{\partial \Big ( \log p(t=k) + \sum _{i=1}^N \log p(x_i|t=k)\Big )}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \sum _{i=1}^N \frac{\partial \log p(x_i|t=k)}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \sum _{i=1}^N \frac{1}{p(x_i|t=k)} \frac{\partial p(x_i|t=k)}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \sum _{i=1}^N \frac{1}{p(x_i|t=k)} \frac{\partial \big (\sum _{z'} p(z_i=z',x_i|t=k) \big )}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \sum _{i=1}^N \frac{1}{p(x_i|t=k)} \frac{\sum _{z'} \partial p(z_i=z',x_i|t=k)}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \sum _{i=1}^N \frac{1}{p(x_i|t=k)} \frac{\sum _{z'} \partial \Big (p(z_i=z'|t=k) p(x_i|z_i=z')\Big )}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \sum _{i=1}^N \frac{1}{p(x_i|t=k)} \frac{\partial \Big (\pi _k^k p(x_i|z_i=k) + \pi _k^m p(x_i|z_i=m) + \pi _k^a p(x_i|z_i=a) \Big )}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \sum _{i=1}^N \frac{1}{p(x_i|t=k)} \frac{\pi _k^k \partial p(x_i|z_i=k)}{\partial \theta _{q,\alpha ,q'}^\tau }. \end{aligned}$$
(4)
When \(\tau \) is not equal to k, Eq. (2) becomes 0. On the other hand, if \(\tau =k\), then we would need to calculate \(\frac{\partial p(x_i|z_i=\tau )}{\partial \theta _{q,\alpha ,q'}^\tau }\), where \(p(x_i|z_i=\tau )\) can be rewritten as \(\sum _{q_{0:L}} p(x_i,q_{0:L}|z_i=\tau )\) as \(x_i\) is generated by a PFSM. Note that \(q_{0:L}\) denotes the states visited to generate \(x_i\). The derivative can be derived as follows:
$$\begin{aligned} \frac{ \partial p(x_i|z_i=\tau )}{\partial \theta _{q,\alpha ,q'}^\tau }&= \frac{ \partial \Big ( \sum _{q_{0:L}} p(x_i,q_{0:L}|z_i=\tau ) \Big )}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \sum _{q_{0:L}} \frac{\partial p(x_i,q_{0:L}|z_i=\tau )}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \sum _{q_{0:L}} p(x_i,q_{0:L}|z_i=\tau ) \frac{\partial \log p(x_i,q_{0:L}|z_i=\tau )}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \sum _{q_{0:L}} p(x_i,q_{0:L}|z_i=\tau ) \frac{\partial \log \Big [ I^{\tau }(q_0) \Big ( \prod _{l=0}^{L-1} T^{\tau }(q_l, x_i^l, q_{l+1}) \Big ) F^{\tau }(q_L)\Big ]}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \sum _{q_{0:L}} p(x_i,q_{0:L}|z_i=\tau ) \frac{\partial \sum _{l=0}^{L-1} \Big ( \log T^{\tau }(q_l, x_i^l, q_{l+1}) \Big )}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \sum _{q_{0:L}} p(x_i,q_{0:L}|z_i=\tau ) \sum _{l=0}^{L-1} \frac{\partial \log T^{\tau }(q_l, x_i^{l}, q_{l+1}) }{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \sum _{q_{0:L}} p(x_i,q_{0:L}|z_i=\tau ) \sum _{l=0}^{L-1} \frac{\delta (q_l,q) \delta (x_i^l, \alpha ) \delta (q_{l+1}, q')}{T^{\tau }(q_l, x_i^l, q_{l+1})}, \nonumber \\&= \sum _{q_{0:L}} \sum _{l=0}^{L-1} p(x_i,q_{0:L}|z_i=\tau ) \Big (\frac{\delta (q_l,q) \delta (x_i^l, \alpha ) \delta (q_{l+1}, q')}{T^{\tau }(q_l, x_i^l, q_{l+1})}\Big ), \nonumber \\&= \sum _{q_{0:L}} \sum _{l=0}^{L-1} p(q_l=q, q_{l+1}=q', q_{0:L \setminus l,l+1}, x_i|z_i=\tau ) \Big (\frac{\delta (x_i^l, \alpha )}{T^{\tau }(q, x_i^l, q')}\Big ), \nonumber \\&= \sum _{l=0}^{L-1} \sum _{q_{0:L}} p(q_l=q, q_{l+1}=q', q_{0:L \setminus l,l+1}, x_i|z_i=\tau ) \Big (\frac{\delta (x_i^l, \alpha )}{T^{\tau }(q, x_i^l, q')}\Big ), \nonumber \\&= \sum _{l=0}^{L-1} \frac{\delta (x_i^l, \alpha ) p(q_l=q, q_{l+1}=q', x_i|z_i=\tau ) }{T^{\tau }(q, x_i^l, q')}. \end{aligned}$$
(5)
Hence, we need to evaluate the joint probability \(p(q_l=q, q_{l+1}=q', x_i|z_i=\tau )\) for each l where \(x_i^l=\alpha \), which can be found by marginalizing out the variables \(q_{0:L \setminus \{l,l+1\}}\):
$$\begin{aligned} p(q_l=q, q_{l+1}=q', x_i|z_i=\tau )&= \sum _{q_{l'}} p(q_l=q, q_{l+1}=q', q_{l'}, x_i|z_i=\tau ), \end{aligned}$$
(6)
where \(l'\) denotes \(\{0:L\} \setminus \{l,l+1\}\). This probability can be calculated iteratively via Forward-Backward Algorithm where the forward and backward messages are defined iteratively as follows:
$$\begin{aligned} v_{l \rightarrow l+1}(q_l)&= \sum _{q_{l-1}} T^\tau (q_{l-1}, x_i^l, q_l) v_{l-1 \rightarrow l}(q_{l-1}), \nonumber \\ \lambda _{l+1 \rightarrow l}(q_{l+1})&= \sum _{q_{l+2}} T^\tau (q_{l+1}, x_i^{l+2}, q_{l+2}) \lambda _{l+2 \rightarrow l+1}(q_{l+2}), \end{aligned}$$
(7)
We can then rewrite \(p(q_l=q, q_{l+1}=q', x_i|z_i=\tau )\) as follows:
$$\begin{aligned} p(q_l, q_{l+1}, x_i|z_i=\tau )&= (v_{l \rightarrow l+1}(q_{l}) \bullet \lambda _{l+1 \rightarrow l}(q_{l+1})) \odot T^\tau (q_l, x_i^{l+1}, q_{l+1}), \end{aligned}$$
(8)
where \(\bullet \) and \(\odot \) denote respectively outer and element-wise product.
Derivative of \(L_f\)
Let us now take the derivative of the second term \(L_f\):
$$\begin{aligned} \frac{\partial L_f}{\partial \theta _{q,\alpha ,q'}^\tau }&= \frac{\partial \log p(x)}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \frac{\partial \sum _{i=1}^N \log p(x_i)}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \sum _{i=1}^N \frac{\partial \log p(x_i)}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \sum _{i=1}^N \frac{1}{p(x_i)} \frac{\partial \big (\sum _{t'} \sum _{z'} p(t=t',z_i=z',x_i) \big )}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \sum _{i=1}^N \frac{1}{p(x_i)} \frac{\partial \big (\sum _{z'} p(t=\tau ,z_i=z',x_i) \big )}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \sum _{i=1}^N \frac{1}{p(x_i)} \frac{\partial \big (\sum _{z'} p(t=\tau ) p(z_i=z'|t=\tau ) p(x_i|z_i=z') \big )}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \sum _{i=1}^N \frac{1}{p(x_i)} \frac{p(t=\tau ) \partial \big (\pi _\tau ^\tau p(x_i|z_i=\tau ) + \pi _\tau ^m p(x_i|z_i=m) + \pi _\tau ^a p(x_i|z_i=a) \big )}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \sum _{i=1}^N \frac{1}{p(x_i)} \frac{p(t=\tau ) \pi _\tau ^\tau \partial p(x_i|z_i=\tau )}{\partial \theta _{q,\alpha ,q'}^\tau }, \nonumber \\&= \sum _{i=1}^N \frac{p(t=\tau ) \pi _\tau ^\tau }{p(x_i)} \frac{ \partial p(x_i|z_i=\tau )}{\partial \theta _{q,\alpha ,q'}^\tau }. \end{aligned}$$
(9)
Let us now put all the equations together. When we are calculating the derivative of Eq. (1) w.r.t. the correct machine, i.e., \(\tau =k\), the derivative becomes the following:
$$\begin{aligned} \frac{\partial \log p(t=k|x)}{\partial \theta ^\tau _{q,\alpha ,q'}}&{=} \sum _{i=1}^N \Big ( \frac{\pi _k^k}{p(x_i|t=k)} \frac{\partial p(x_i|z_i=k)}{\partial \theta _{q,\alpha ,q'}^k} {-} \frac{\pi _k^k p(t=k)}{p(x_i)} \frac{ \partial p(x_i|z_i=k)}{\partial \theta _{q,\alpha ,q'}^k} \Big ), \nonumber \\&= \sum _{i=1}^N \Big ( \pi _k^k \frac{\partial p(x_i|z_i=k)}{\partial \theta _{q,\alpha ,q'}^k} \Big ( \frac{1}{p(x_i|t=k)} - \frac{p(t=k)}{p(x_i)}\Big ) \Big ), \nonumber \\&= \sum _{i=1}^N \frac{\pi _k^k}{p(x_i|t=k)} \frac{\partial p(x_i|z_i=k)}{\partial \theta _{q,\alpha ,q'}^k} \Big ( 1 - \frac{p(t=k) p(x_i|t=k)}{p(x_i)}\Big ), \nonumber \\&= \sum _{i=1}^N \frac{\pi _k^k}{p(x_i|t=k)} \frac{\partial p(x_i|z_i=k)}{\partial \theta _{q,\alpha ,q'}^k} \Big ( 1{-} \frac{p(t=k,x_i)}{\sum _{k'} p(t=k',x_i)}\Big ). \end{aligned}$$
(10)
When we are calculating the derivative of Eq. (1) w.r.t. the wrong machines, i.e., \(\tau \ne k\) this becomes:
$$\begin{aligned} \frac{\partial \log p(t=k|x)}{\partial \theta ^\tau _{q,\alpha ,q'}}&= - \sum _{i=1}^N \Big ( \frac{\pi _\tau ^\tau p(t=\tau )}{p(x_i)} \frac{ \partial p(x_i|z_i=\tau )}{\partial \theta _{q,\alpha ,q'}^\tau } \Big ). \end{aligned}$$
(11)
Lastly, we ensure the parameters remain positive and normalized using the softmax function. We define \(T_\tau (q,\alpha ,q') = \exp {T_\tau ^z(q,\alpha ,q')}/(\exp {F_\tau ^z(q)} + \sum _{\alpha ',q''} \exp T_\tau ^z(q,\alpha ',q'') )\) and \(I_\tau ^z(q) = \exp {I_\tau ^z(q)}/\sum _{q'} \exp {I_\tau ^z(q')}\). We now update these new unconstrained parameters using the new gradient calculated via the chain rule: \(\partial f/\partial T_\tau ^z(q,\alpha ,q') = (\partial f/ \partial T_\tau (q,\alpha ,q')) (\partial T_\tau (q,\alpha ,q') /\partial T_\tau ^z(q,\alpha ,q'))\).
Derivations for inference
The posterior distribution of column type t can be derived as follows:
$$\begin{aligned} p(t=k|\mathbf{x} )\propto & {} p(t=k, \mathbf{x} ), \\= & {} p(t=k) \prod _{i=1}^N p(x_i | t=k), \\= & {} p(t=k) \prod _{i=1}^N \Big ( \pi _k^k p(x_i | z_i = k) \\&+ \pi _k^m p(x_i | z_i = m) + \pi _k^a p(x_i | z_i = a) \Big ). \end{aligned}$$
Let us assume that \(t=k\) according to \(p(t|\mathbf{x} )\), the posterior distribution of column type. Then we can write the posterior distribution of row type \(z_i\) given \(t=k\) and \(\mathbf{x} \) as:
$$\begin{aligned} p(z_i=j | t=k, \mathbf{x} )= & {} \frac{p(z_i=j, x_i | t=k)}{p(x_i | t=k)}, \nonumber \\= & {} \frac{p(z_i=j, x_i | t=k) }{\sum _{z_i \in \{k,m,a\}} p(z_i, x_i | t=k)}. \end{aligned}$$
(12)
Table 5 A sample test dataset
The outputs of the PADS library
We have mentioned previously that the outputs generated by the PADS library do not directly address our problem. We present a sample from an example test dataset in Table 5, and a part of the corresponding output of the PADS library in Fig. 7.
The outputs are interpreted starting from the bottom. In this case, the data is defined as an array of “struct” type named Struct_194. This is further characterized as a combination of various “union” types. For example, let us consider the first one named Union_19 which consists of a constant string \( \texttt { \& nbsp;}\), another constant string \(\texttt {year}\) and and integer type. However, this can be more complicated as in type Union_165 consisting of two struct types Struct_192 and Struct_164. Note that the former is further divided into a union type, whereas the latter is described as a combination of some constant strings and a float type. As the reader can see, it can become difficult and time-consuming to interpret an output. Moreover, the output becomes more complex when delimiters are inferred correctly, as this can prevent the types from column specific.
Table 6 Performance of the methods using the Jaccard index and overall accuracy, for the types Date, Logical, Numeric and Text Additional experimental results
Table 6 presents the comparisons with hypoparsr.
Figure 8 presents the normalized confusion matrices for the methods, discussed in the paper.
Scalability of the methods
Table 7 denotes the number of rows, columns, unique elements and the time passed to infer column types.
Table 7 Size of the test datasets and the times in seconds it takes to infer column types per column (on average), where U denotes the number of unique data entries in a dataset