Introduction
Avian influenza (AI) is a disease caused by influenza type A viruses. The natural reservoir for avian influenza virus (AIV) is aquatic wild birds (Poetri, Reference Poetri2014); however, AIV can also infect domestic poultry in addition to other avian and mammalian species. AIV also sporadically infects human beings and is, therefore, regarded as a zoonotic virus (CDC, 2010).
AIV outbreaks in the commercial poultry industry pose a continuing threat. To contain outbreaks of AIV, various control measures such as culling of the birds, quarantine, isolation, and vaccination, have been applied. Such policies, however, may lead to substantial financial losses, regardless of their effectiveness. A large number of studies have employed mathematical models to gain a better understanding of how AIV outbreaks occur, and also to facilitate determining which factors contribute to AI progression. Modeling methods are used to select cost-effective strategies for control of AIV outbreaks.
A mathematical model is a simple and quantitative representation of a real-world function. Mathematical models can provide a theoretical framework to test real-world scenarios (Siettos and Russo, Reference Siettos and Russo2013) or predict the output of complex systems, in which performing a real experiment is costly or impossible. Furthermore, computer simulations in conjunction with mathematical models can bring realism to the models and approximate the behavior of real systems. Mathematical models have been used in AI research (Wiratsudakul, Reference Wiratsudakul2014; Maseleno et al., Reference Maseleno, Hasan, Tuah and Tabbu2015) to explore patterns and dynamics of disease spread, to assess the effectiveness of interventions, and to manage containment plans during outbreaks (Dorjee et al., Reference Dorjee, Poljak, Revie, Bridgland, McNab, Leger and Sanchez2013). Mathematical modeling constitutes a step in the process of knowledge discovery in databases (KDD). In general, KDD refers to a broader process of finding knowledge in data sources. Therefore, this review focuses not only on mathematical models, but also on the entire process of KDD. The present report provides a review of AI modeling methods and explains the advantages of novel methods that can be used in KDD processes in this field. In general, the existing modeling methods can be divided into two groups based on the amount of data that are required to construct them. For each group, the goal of this review is to summarize previous work, and to identify the existing gaps concerning the knowledge discovery process in addition to describing ways to address these gaps.
Knowledge discovery in databases (KDD)
KDD refers to the overall process of extracting novel and useful patterns from data sources (Fayyad et al., Reference Fayyad, Piatetsky-Shapiro and Smyth1996; Williams and Huang, Reference Williams and Huang1996). The primary goal of KDD is to transform data from large databases into new knowledge (Qi, Reference Qi2008). Currently, several data sources relevant to AI outbreak detection and containment, such as sensor networks, social media, and satellite technology, are being collected and accumulated at a dramatic pace. For example, sensor networks can be used to monitor AI in poultry farms and satellite images can be used to monitor environmental changes. Such data sources can provide an opportunity to gain precise and timely knowledge required for AI containment planning. Data, however, are usually streaming, large, and in varying formats. These types of data require continuous and automatic storage, pre-processing, analysis, interpretation, and evaluation. Therefore, not only data collection and analysis, but the whole chain of KDD is required to guarantee high quality knowledge discovery for AI-related decision-making.
A general overview of the KDD process is presented in Fig. 1. According to Nakamori (Reference Nakamori2011), KDD is comprised of five phases:
(1) Problem definition: Understanding the problem domain is a necessary prerequisite for a relevant knowledge extraction task. In the case of AI, experts from different disciplines, such as epidemiology, environmental science, statistics, and computer science should collaborate on the underlying problem that is being addressed in a data analysis task. They should also determine prior knowledge, potential data sources, requirements, and project objectives. In general, without problem definition, even the most advanced techniques will be incapable of providing the desired results.
(2) Data collection and pre-processing: After obtaining a clear understanding of the problem, domain experts explore the data, create a target dataset from the available data sources, and prepare data for deriving knowledge. Traditionally, the process of data collection was paper-based (Blumenberg and Barros, Reference Blumenberg and Barros2016), which is indeed time-consuming. Recently, due to the high volume of data generated by the internet, digital devices, and computational simulations, epidemiological studies have been turning into a data-intensive discipline. To this end, recently, concepts such as ontology (Pesquita et al., Reference Pesquita, Ferreira, Couto and Silva2014) have been introduced to facilitate the integration, validation, searching, and sharing of epidemiological resources. Ontology is a standard description of domain concepts and their relationships (Noy and McGuinness, Reference Noy and McGuinness2001). A potential application of ontology in the field of AI can be to define the relationship between conceptual entities, such as highly pathogenic AI (HPAI), low pathogenic AI (LPAI), AIV subtypes, and outbreak locations. Such an ontology can be used to search AI databases or aggregate data sources.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20200129162817806-0012:S1466252319000033:S1466252319000033_fig1.png?pub-status=live)
Fig. 1. KDD process.
Data pre-processing is a labor-intensive step in the KDD process. This step serves several purposes including cleaning, quality improvement, and dimensionality reduction of data, in addition to managing large volumes of data that are not capable of being processed in the computer memory. Data cleaning involves several tasks, such as the removal of noise and inconsistent data, managing missing data fields, and discretization of data (RamrezGallego et al., Reference RamrezGallego, Garca, MourioTaln, MartnezRego, BolnCanedo, AlonsoBetanzos, Bentez and Herrera2016), which are necessary to correct inaccurate data. Record linkage (Dusetzina et al., Reference Dusetzina, Tyree, Meyer, Meyer, Green and Carpenter2014) is another pre-processing task to improve data quality and integrity. Furthermore, data transformation can be implemented by reducing the number of data elements without destroying the validity of data. During the cleaning stage, depending upon the goal of data mining, representations of data such as normalization, type converting, aggregation, or smoothing may be required.
The term ‘data instance’ or ‘data example’ describes a single object of a dataset. Instances are described by ‘feature’ vectors. A feature is a specification that defines a property of a data entity. The term feature is sometimes used synonymously with ‘attribute’ or ‘dimension’. Recent trends in data collection have resulted in datasets with enormous dimensions. This problem is called ‘curse of dimensionality’ (Bellman, Reference Bellman2013), and appears in datasets with hundreds or thousands of features. Curse of dimensionality increases computation cost, storage requirements, and time required for analyses.
As a solution, several feature selection (FS) methods have been proposed with the objective of finding a subset of features that are most representative, and then discarding the rest (Alpaydin, Reference Alpaydin2014; Chi, Reference Chi2009). Therefore, the selected subset is a reduced representation of the initial data, meaning that it is much smaller than the initial dataset in size, but produces the same results. FS methods can be divided into three categories: filter, wrapper, and embedded (Neumann et al., Reference Neumann, Riemenschneider, Sowa, Baars, Kälsch, Canbay and Heider2016). In filter methods, the FS step is a separate pre-processing step from the machine learning (ML) model. This group of methods assesses the properties of data using scoring metrics such as the chi-square test, mutual information, correlation coefficients, Fisher's discriminant scores, and variance threshold. Filter methods have been used in AI studies to find the most critical features and investigate which ones are statistically significant (Herrick, Reference Herrick2013; Si et al., Reference Si, de Boer and Gong2013; Gilbert et al., Reference Gilbert, Golding, Zhou, Wint, Robinson, Tatem, Lai, Zhou, Jiang and Guo2014; Wang et al., Reference Wang, Wang, Cheng, Yu, Ling, Mao and Chen2017). Although filter methods are computationally efficient and fast, they do not involve any learning, which may affect the classification accuracy (Hira and Gillies, Reference Hira and Gillies2015).
Wrapper methods, however, use ML models to measure the quality of candidate subsets of features by searching in the feature space. Genetic algorithm (GA) is an example of wrapper FS methods. GAs are search techniques used for the selection of populations of solutions to a problem. GAs are inspired from the natural evolution and genetic mechanisms of living things. Although wrapper methods outperform filter methods in terms of accuracy (Neumann et al., Reference Neumann, Riemenschneider, Sowa, Baars, Kälsch, Canbay and Heider2016), they are computationally expensive and can suffer from over-fitting. Random forest (RF), an ensemble of decision trees that has been used to identify the most significant risk factors for the prediction of AI, is an example of wrapper FS (Herrick et al., Reference Herrick, Huettmann and Lindgren2013). Finally, embedded methods combine filter and wrapper FS methods and offer low-cost and high accuracy. Recursive feature elimination is an embedded FS method. Despite the benefits that this method offers, embedded FS methods have not been popular in AI literature.
Feature extraction is another approach to create a lower dimension of data. Feature extraction constructs a new set of features by combining original features (Alpaydin, Reference Alpaydin2014). For example, principal component analysis is a well-known feature extraction method. To the best of our knowledge, feature extraction methods have not been used in AI modeling. In studies aimed at risk factor analysis (Nishiguchi et al., Reference Nishiguchi, Kobayashi, Yamamoto, Ouchi, Sugizaki and Tsutsui2007; Busani et al., Reference Busani, Valsecchi, Rossi, Toson, Ferre, Dalla Pozza and Marangon2009; Gonzales, Reference Gonzales Rojas2012; Nguyen, Reference Nguyen2013), feature extraction can be used as an approach for creating new covariates.
(3) Data mining (DM): This stage requires selecting a dataset and the appropriate DM algorithms for a specific mining objective. The DM algorithms then discover patterns that exist in data. ML and statistical methods are examples of many different approaches used in DM. ML approaches can be grouped into supervised learning, unsupervised learning, and semi-supervised learning methods. Unsupervised methods find useful patterns from unlabeled data. Clustering is an example of unsupervised ML algorithms that map data into clusters based on similarity metrics or probability density models. By contrast, supervised methods learn to map labeled data. Classification is a supervised ML task that learns from labeled data (i.e. training data) and categorizes data into one of several predefined classes. Regression falls into the supervised learning group and is a measure to determine the strength of the relationship between covariates and an independent variable. Since in many real-world applications, labeled data may be expensive and unavailable, a third group of ML has been introduced (Chapelle et al., Reference Chapelle, Scholkopf and Zien2009). This group, known as semi-supervised learning, is trained on a combination of labeled and unlabeled data.
There have been recent attempts in epidemiological research to use semi-supervised (Zhao et al., Reference Zhao, Chen, Chen, Wang, Lu and Ramakrishnan2015), supervised (Erraguntla et al., Reference Erraguntla, Ramachandran, Wu and Mayer2010; Santillana et al., Reference Santillana, Nguyen, Dredze, Paul, Nsoesie and Brownstein2015; Valdes-Donoso et al., Reference Valdes-Donoso, VanderWaal, Jarvis, Wayne and Perez2017), and unsupervised (Chen et al., Reference Chen, Hossain, Butler, Ramakrishnan and Prakash2016; Ghosh et al., Reference Ghosh, Chakraborty, Nsoesie, Cohn, Mekaru, Brownstein and Ramakrishnan2017; Lim et al., Reference Lim, Tucker and Kumara2017) learning approaches. In AI research, unsupervised ML algorithms such as K-mean are recommended for spatiotemporal profiling, outbreak detection, and surveillance studies.
(4) Post processing: After building one or more models, the next step is to interpret the obtained knowledge from the DM algorithms. The aim is to see whether suitable patterns have been discovered with respect to the goals defined in the first step. In this phase, various visualizations such as boxplots, histograms, time series plots, or two-dimensional scatter plots are used as a part of the evaluation stage.
(5) Practical use: The final goal of KDD is to use the newly obtained knowledge in real-world applications. In other words, the knowledge captured in the process needs to be organized and depicted in a way that a user or a machine can use it. Depending on the goal of a knowledge discovery process, a variety of applications may be built and provided to the user. In a potential AI decision support system, the goal of the KDD process could be: presenting reports, outbreak warnings, outbreak spread monitoring, outbreak prediction, and assessing intervention policies.
Among the five steps of KDD, the DM step is highlighted in epidemiological research. However, other steps of KDD are also essential and disregarding them may lead to inappropriate outcomes. Furthermore, if unsatisfactory results occur in any phase of the KDD process, it is possible to return to earlier stages and repeat them (Zhang et al., Reference Zhang, Segall and Cao2010). Therefore, applying a comprehensive and iterative KDD is a factor in the success of epidemiological research as it assists in making sound decisions and finding the best possible outcome in a situation.
Data-intensive modeling
AI modeling methods may be classified into two categories: data-intensive modeling and small-data modeling. The central goal of data-intensive modeling is mining new insights from vast and diverse datasets such as click-stream, geo-location data, sensor network data, and digital health records (Marathe and Ramakrishnan, Reference Marathe and Ramakrishnan2013).
Time-series analysis
Time-series data are a sequence of numerical data points in successive order showing how a given variable changes over time. The associated patterns obtained from time-series models are beneficial to predict future events. A commonly used time-series model in multiple previous studies (Soebiyanto et al., Reference Soebiyanto, Adimi and Kiang2010; Kane et al., Reference Kane, Price, Scotch and Rabinowitz2014; Permanasari et al., Reference Permanasari, Utami, Hidayah and Kusumawardani2015; Chadsuthi et al., Reference Chadsuthi, Iamsirithaworn, Triampo and Modchang2015; Ngattia et al., Reference Ngattia, Coulibaly, Nzussouo, Kadjo, Chérif, Traoré, Kouakou, Kouassi, Ekra and Dagnan2016) is auto-regressive integrated moving average (ARIMA) or Box–Jenkins model (Box et al., Reference Box, Jenkins, Reinsel and Ljung2015). ARIMA is the combination of the auto-regressive model, the moving average model, and the auto-regressive moving average model.
The time-series analyses in AI research have been applied to model the temporal changes of AI incidence and to forecast possible outbreaks. For example, a non-seasonal ARIMA model was built in a study by Permanasari et al. (Reference Permanasari, Utami, Hidayah and Kusumawardani2015) to forecast future occurrences of AI. The prediction was made based on a 10-year monthly time-series of AI incidence in two regions of Indonesia. The required parameters of the ARIMA model were selected using three tests, including parameter significance, white noise, and residual normality. Similarly, ARIMA and RF time-series models have been used by Kane et al. (Reference Kane, Price, Scotch and Rabinowitz2014) to predict the future occurrence of AIV outbreaks.
In terms of data sources used in time-series analysis of AI, studies are usually limited to the temporal changes of AI incidence. While the history of disease incidence is an important factor to consider in the prediction of future outbreaks, the role of other risk factors cannot be ignored. Accordingly, in other infectious disease studies, the association between time-series of disease incidence and climate factors has been examined (Chadsuthi et al., Reference Chadsuthi, Iamsirithaworn, Triampo and Modchang2015; Ngattia et al., Reference Ngattia, Coulibaly, Nzussouo, Kadjo, Chérif, Traoré, Kouakou, Kouassi, Ekra and Dagnan2016). Influenza outbreaks have been predicted by incorporating climate factors such as rainfall and temperature as inputs for the ARIMAX model, which is an ARIMA with additional explanatory variables. For example, Soebiyanto et al. (Reference Soebiyanto, Adimi and Kiang2010) showed that including climatic variables in ARIMA models leads to better performance compared to including only past case values. Additionally, Chadsuthi et al. (Reference Chadsuthi, Iamsirithaworn, Triampo and Modchang2015) showed the best performance for central regions of Thailand was obtained using the ARIMAX model that included the average temperature and the minimum relative humidity, whereas, for southern regions, minimum relative humidity as input series resulted in the best model. Similarly, Ngattia et al. (Reference Ngattia, Coulibaly, Nzussouo, Kadjo, Chérif, Traoré, Kouakou, Kouassi, Ekra and Dagnan2016) concluded that adding rainfall factor increases the performance of the ARIMAX model.
Classical statistical models, such as ARIMA and support vector machine, are present in the literature concerning infectious disease (Zhang et al., Reference Zhang, Zhang, Young and Li2014; Chadsuthi et al., Reference Chadsuthi, Iamsirithaworn, Triampo and Modchang2015; Imai et al., Reference Imai, Armstrong, Chalabi, Mangtani and Hashizume2015; Song et al., Reference Song, Xiao, Deng, Kang, Zhang and Xu2016) and AI (Kane et al., Reference Kane, Price, Scotch and Rabinowitz2014; Permanasari et al., Reference Permanasari, Utami, Hidayah and Kusumawardani2015). However, computational intelligence models such as those introduced by Ma et al. (Reference Ma, Tao, Wang, Yu and Wang2015) are not widely used, despite the fact that they have the potential to outperform classical techniques. Classical models usually require pre-defined assumptions, such as normally distributed residuals. Also, the performance of classical models may potentially be jeopardized by noisy or missing data. Models such as long short-term memory and recurrent neural network can discover non-linear and high-dimensional relationships in data (Ma et al., Reference Ma, Tao, Wang, Yu and Wang2015). Furthermore, the application of ensemble methods such as RF could be considered in AI time-series analysis. Ensemble methods combine multiple models to obtain a single output in order to achieve a better performance than any individual model. Recently, the potential of ensemble methods has been investigated for decision-making in infectious disease surveillance (Ray and Reich, Reference Ray and Reich2018). Also, some studies related to infectious disease have discovered that most often RF results in a better prediction performance than ARIMA (Kane et al., Reference Kane, Price, Scotch and Rabinowitz2014; Wu et al., Reference Wu, Cai, Wu, Zhong, Li, Zheng, Lin and Li2017). These methods can be used in future AI research to improve the accuracy and reliability of predictions in comparison with a single model (Araque et al., Reference Araque, Corcuera-Platas, Sanchez-Rada and Iglesias2017).
Social media surveillance
Traditionally, reports from hospitals or public health centers have been used for disease surveillance (Robertson and Yee, Reference Robertson and Yee2016). However, these passive case reports are usually manually created, and are reported 1–2 weeks after the cases are diagnosed. This can delay subsequent actions in the case of disease outbreaks. After the invention of social media, blogging websites, and web searches, online media have been employed as surrogate data sources. To obtain data from online media, crawlers and application programing interfaces (APIs) have been utilized. Many websites offer APIs for their services, which allow third parties to query and fetch data in a convenient format. A crawler is an Internet bot used for browsing websites and social media automatically and regularly. Disease trends in social media have been employed for epidemiological purposes, as they enable authorities to track, predict, and be informed of disease emergencies.
Several studies have been carried out to examine the value of social media for human disease surveillance and to ensure its potential for being a surrogate source for the common reports of disease. For instance, the strength of relationships between disease-related posts and reports from health institutions (e.g. the Center for Disease Control and Prevention (CDC), the World Health Organization (WHO), and the World Organization for Animal Health (OIE)) have been measured using correlation.
To study social media for animal disease surveillance, there are several barriers and challenges to overcome. For instance, when analyzing social media posts regarding influenza, it is critical to differentiate between human and animal influenza. This is because users of social media are people who usually use it to communicate their daily events. Therefore, it is more likely that social media posts represent cases of human influenza. Consequently, several studies that exploit social media for the purpose of surveillance have focused on disease among human populations (Achrekar et al., Reference Achrekar, Gandhe, Lazarus, Yu and Liu2011; Szomszor et al., Reference Szomszor, Kostkova and Louis2011; Chen et al., Reference Chen, Hossain, Butler, Ramakrishnan and Prakash2016; Sharpe et al., Reference Sharpe, Hopkins, Cook and Striley2017). Nevertheless, findings by Szomszor et al. (Reference Szomszor, Kostkova and Louis2011) indicate that social media users share articles from official resources. Therefore, articles regarding animal disease can be shared on social media. This provides researchers with an opportunity to employ social media for monitoring animal disease, such as AI. AI surveillance using social media has been previously attempted. Robertson and Yee (Reference Robertson and Yee2016) introduced an online AI surveillance system to detect the AIV outbreaks. First, AI-related Twitter posts were collected, and outbreaks were identified based on anomalies in the time-series data. After comparing the detected outbreaks in Twitter with AIV outbreak reports from the OIE in the same period, a strong correlation was discovered. Also, anomalies were detected using a general linear time-series algorithm based on static and dynamic thresholds. Moreover, a latent Dirichlet allocation model was applied to the outbreak data to extract topics, concluding that the dynamic threshold leads to more meaningful topics. Further research in social media mining is required to determine whether social media can be employed as a reliable online surveillance mechanism for AI.
Another challenge in social media analysis is the volume of data. The growth of data has led to the development of new database technologies. Large amounts of data extracted from social media make analysis and daily maintenance of data difficult. Both relational (Byrd et al., Reference Byrd, Mansurov and Baysal2016; Jayawardhana, Reference Jayawardhana2016) and NoSQL (Padmanabhan et al., Reference Padmanabhan, Wang, Cao, Hwang, Zhao, Zhang and Gao2013; Wang F et al., Reference Wang, Wang, Xu, Raymond, Chon, Fuller and Debruyn2016) databases have been used in infectious disease surveillance using social media data. Traditional relational databases are designed to store small amounts of relevant data, while NoSQL (not only structured query language) databases are suitable for non-structured data (e.g. articles, photos, social media data, or videos). In comparison with relational databases, NoSQL databases provide a number of advantages. NoSQL databases offer lower cost, easier scalability, and open source features, which make them a candidate option for AI surveillance applications that employ large social media data.
Social media data preprocessing can be challenging and time consuming. Social media contains spam messages that need to be discarded and unstructured text that needs to be transformed to an interpretable form for DM algorithms. In general, spam removal has been performed in a limited number of studies that developed surveillance systems to monitor disease from social media (Szomszor et al., Reference Szomszor, Kostkova and Louis2011; Kostkova et al., Reference Kostkova, Szomszor and St Louis2014; Signorini, Reference Signorini2014). In the AI surveillance study (Robertson and Yee, Reference Robertson and Yee2016) that employed Twitter, several data cleaning operations such as stop word removal were performed, but spam removal was skipped. Removing spam, however, can enhance the accuracy of disease surveillance systems.
Furthermore, there are several gaps in the currently used methods for pre-processing of social media data for disease surveillance. For instance, manual spam removal methods, such as the link ratio calculation method used by Szomszor et al. (Reference Szomszor, Kostkova and Louis2011), are only applicable to a specific group of tweets (i.e. with a link). In addition, hand-crafted features (e.g. bag of words method) have usually been employed as the input of the spam detection classification algorithms. The manual process of feature extraction, however, involves human labor and relies on expert knowledge. Therefore, state-of-the-art methods, such as deep learning algorithms, have potential to be used for spam detection from texts. Deep learning algorithms are capable of generating word or sentence representation automatically as part of their learning process.
In general, among KDD steps related to social media surveillance research, the preprocessing step has received much attention. This is because social media text is unstructured, requiring its transformation to an interpretable form for DM algorithms. Moreover, among DM methods, correlation and classification are widely used in social media analyses.
Spatiotemporal risk prediction models
Spatiotemporal variabilities are key to reliable predictions of infectious disease (Arab, Reference Arab2015). However, spatiotemporal predictions depend on the availability of relevant time- and space-related health data. Recent computational advances combined with accessibility to data containing time and space dimensions have made spatiotemporal models more popular methods (Arab, Reference Arab2015). These models are used to analyze the spatiotemporal evolution of infectious diseases and assess the effect of control policies. In spatiotemporal models, clusters of disease are usually depicted on geographical maps to show the risk of disease occurrence (Gilbert and Pfeiffer, Reference Gilbert and Pfeiffer2012).
In the AI modeling literature, considerable efforts have been put forth to find a connection between AI and environmental factors (Erraguntla et al., Reference Erraguntla, Ramachandran, Wu and Mayer2010; Herrick et al., Reference Herrick, Huettmann and Lindgren2013; Mu et al., Reference Mu, McCarl, Wu and Ward2014). Furthermore, some studies have connected AIV transmission with migratory birds and poultry trade. For instance, Kilpatrick et al. (Reference Kilpatrick, Chmura, Gibbons, Fleischer, Marra and Daszak2006) determined H5N1 HPAI pathways that led to introduction of the virus into 52 countries and predicted the most likely mechanisms, including migratory bird movements and poultry trade, that facilitated the spread of AIV. In addition, the impact of agriculture and ecology, such as the presence of ducks and rice harvests, on the risk of HPAI has been explored (Gilbert et al., Reference Gilbert, Xiao, Chaitaweesub, Kalpravidh, Premashthira, Boles and Slingenbergh2007; Martin et al., Reference Martin, Pfeiffer, Zhou, Xiao, Prosser, Guo and Gilbert2011b).
In addition to the risk prediction studies, attempts have been made to simulate the spread of AIV by considering a more comprehensive list of risk factors. To this end, Patyk et al. (Reference Patyk, Helm, Martin, Forde-Folle, Olea-Popelka, Hokanson, Fingerlin and Reeves2013) conducted a transmission simulation of HPAI H5N1 infection in commercial and backyard domestic poultry in South Carolina. They divided risk factor parameters into direct, indirect, and airborne. Subsequently, the North American Animal Disease Spread Model (NAADSM) was used to simulate H5N1 transmission. NAADSM is a well-established stochastic spread simulation framework designed for populations of livestock herds. Ultimately, Patyk et al. (Reference Patyk, Helm, Martin, Forde-Folle, Olea-Popelka, Hokanson, Fingerlin and Reeves2013) concluded that parameters related to indirect contact, such as people movement, vehicles, and fomites, had the highest impact on the number of infected flocks, and the duration of outbreaks.
Risk-based studies usually generate hypotheses using previous observations or from examples in the literature. Hypotheses are then defined and tested to investigate the impact of risk factors on AI outbreaks (Belkhiria et al., Reference Belkhiria, Hijmans, Boyce, Crossley and Martínez-López2018). This method may fail to identify several hypotheses. As a solution, ML methods can be employed to extract rules from observational data. The process of extracting rules can be less time-consuming than generating and testing hypotheses. Approaches such as online analytical processing, association rule mining, and sequential pattern mining have been used to find hidden rules and assess the temporal and spatial transmission of HPAI (Xu et al., Reference Xu, Lee, Park and Chung2017). The outcome information from these analyses assists decision makers in understanding the spatial and temporal routes that AI will likely follow in the future.
Small-data modeling
The following section reviews research on statistical and mathematical modeling methods that rely on approaches such as questionnaire, interview, sampling, contact tracing, and direct observations. These studies can advance the understanding of AIV behaviors and dynamics.
Empirical studies
Empirical studies may be divided into two main groups: (1) studies that estimate AI transmission parameters from experimental and observational data; and (2) studies that exploit observed contact networks to make network models of AI transmission. Mathematical transmission models act as a framework to facilitate the understanding of the complex processes of disease contagion (Wiratsudakul, Reference Wiratsudakul2014). Once epidemiological, traffic, and biological data are imported into mathematical models, transmission patterns and parameters can be quantified (Wilasang et al., Reference Wilasang, Wiratsudakul and Chadsuthi2016). In AI literature, transmission models are made based upon a series of assumptions, including equal infection susceptibility of birds, lack of pre-existing immunity in a flock, and that infected birds demonstrate similar levels of infection.
Estimation of transmission dynamics
A great number of within-flock studies have attempted to estimate the transmission parameters on a flock level (Van der Goot et al., Reference Van der Goot, De Jong, Koch and Van Boven2003; Tiensin et al., Reference Tiensin, Nielen, Vernooij, Songserm, Kalpravidh, Chotiprasatintara, Chaisingh, Wongkasemjit, Chanachai and Thanapongtham2007; Bos et al., Reference Bos, Nielen, Koch, Bouma, De Jong and Stegeman2009; Rohani et al., Reference Rohani, Breban, Stallknecht and Drake2009; Bos et al., Reference Bos, Nielen, Toson, Comin, Marangon and Busani2010; Comin et al., Reference Comin, Klinkenberg, Marangon, Toffan and Stegeman2011; Gonzales et al., Reference Gonzales, Van Der Goot, Stegeman, Elbers and Koch2011; Saenz et al., Reference Saenz, Essen, Brookes, Iqbal, Wood, Grenfell, McCauley, Brown and Gog2012; Wang et al., Reference Wang, Jin, Liu, van de Koppel and Alonso2012; Nickbakhsh et al., Reference Nickbakhsh, Hall, Dorigatti, Lycett, Mulatti, Monne, Fusaro, Woolhouse, Rambaut and Kao2016). The main goal of these studies has been to extract parameter values for future containment programs of control and surveillance (Stegeman et al., Reference Stegeman, Bouma, Elbers, de Jong, Nodelijk, de Klerk, Koch and van Boven2004; Savill et al., Reference Savill, St Rose, Keeling and Woolhouse2006; Tiensin et al., Reference Tiensin, Nielen, Vernooij, Songserm, Kalpravidh, Chotiprasatintara, Chaisingh, Wongkasemjit, Chanachai and Thanapongtham2007; Bouma et al., Reference Bouma, Claassen, Natih, Klinkenberg, Donnelly, Koch and Van Boven2009).
Some of the above studies estimated AIV transmission dynamics by using generalized linear model (GLM) and ‘Final Size’ statistical methods (Van der Goot et al., Reference Van der Goot, De Jong, Koch and Van Boven2003; Comin et al., Reference Comin, Klinkenberg, Marangon, Toffan and Stegeman2011). It is thought that GLM estimation is more precise and more widely used than ‘Final Size’ method (Gonzales et al., Reference Gonzales, Van Der Goot, Stegeman, Elbers and Koch2011). In a study by Gonzales et al. (Reference Gonzales, Van Der Goot, Stegeman, Elbers and Koch2011), the transmission parameters of five chickens infected with a low pathogenic H7N1 virus was explored using five contact chickens. By assuming the latent period of infected birds to be a maximum of 1 day, the mean infectious period, the transmission rate (β), and the basic reproduction ratio (R 0) were estimated. Estimates like these are beneficial for building surveillance and control programs in poultry.
Small empirical studies usually work with a small number of birds (Van der Goot et al., Reference Van der Goot, De Jong, Koch and Van Boven2003; Gonzales et al., Reference Gonzales, Van Der Goot, Stegeman, Elbers and Koch2011). The results, therefore, cannot be directly extrapolated to real-world situations. In other words, the estimated variables in empirical studies have low resolution and may not be precise enough to construct models (Pepin et al., Reference Pepin, Spackman, Brown, Pabilonia, Garber, Weaver, Kennedy, Patyk, Huyvaert and Miller2014). To address this issue, Saenz et al. (Reference Saenz, Essen, Brookes, Iqbal, Wood, Grenfell, McCauley, Brown and Gog2012) estimated the dynamics of LPAI and HPAI spread using a greater number of contact turkeys compared to studies conducted by Gonzales et al. (Reference Gonzales, Van Der Goot, Stegeman, Elbers and Koch2011) and Van der Goot et al. (Reference Van der Goot, De Jong, Koch and Van Boven2003). In the aforementioned studies, the daily number of dead birds was fitted to a stochastic Susceptible-Infectious-Recovered (SIR) model.
Back-calculation has been used by Tiensin et al. (Reference Tiensin, Nielen, Vernooij, Songserm, Kalpravidh, Chotiprasatintara, Chaisingh, Wongkasemjit, Chanachai and Thanapongtham2007) and Bos et al. (Reference Bos, Nielen, Toson, Comin, Marangon and Busani2010) to estimate required AIV transmission parameters. In the back-calculation method, mortality is measured regularly, then, the previous time-series of other classes in the SIR model are calculated according to mortality time-series. These calculations are based on several assumptions, such as a predetermined infectious period and days-to-die after infection. This method is not applicable for LPAI, as the rate of mortality for LPAI is very low. However, in order to measure HPAI H5N1 transmission dynamics within a flock, Tiensin et al. (Reference Tiensin, Nielen, Vernooij, Songserm, Kalpravidh, Chotiprasatintara, Chaisingh, Wongkasemjit, Chanachai and Thanapongtham2007) applied the statistical back-calculation method on 139 flocks of poultry in Thailand. Having access to time-series of recovered (R) (i.e. mortality) and infectious period in a SIR model, the time-series of susceptible (S) and infectious (I) were calculated. The obtained infection time-series was then fitted with GLM using negative binomial likelihood distribution to find the transmission parameters. Depending on the length of infectious period, R 0 was estimated between 2.26 and 2.64. These results can help to evaluate policies with simulation studies.
There are significant differences in the estimated values found between studies performed by Gonzales et al. (Reference Gonzales, Van Der Goot, Stegeman, Elbers and Koch2011) and other similar studies (Van der Goot et al., Reference Van der Goot, De Jong, Koch and Van Boven2003). This difference can be attributed to the origin of the isolated viruses used in these experiments. For instance, the virus used in the study by Gonzales et al. (Reference Gonzales, Van Der Goot, Stegeman, Elbers and Koch2011) originated in turkeys, which are more susceptible to LPAI. Additionally, the inconsistency of output values might have been due to the use of different AIV strains that have various transmission characteristics.
It is worth mentioning that variety of AIV strains, farm characteristics, and birds’ age can lead to inconsistencies in estimated transmission parameters. In order to assess the result of variance in outcome parameters, Comin et al. (Reference Comin, Klinkenberg, Marangon, Toffan and Stegeman2011) took into account a range of values for transmission dynamics. It was concluded that the variation of R 0 plays an essential role in the outputs of an epidemic. Furthermore, the inconsistency of dynamics found in past experiments has been considered in simulation studies of LPAI in chickens (Gonzales et al., Reference Gonzales, Boender, Elbers, Stegeman and de Koeijer2014). In the latter study, a categorization of LPAI dynamics into low and high characteristics was introduced based on the variability in R 0.
Network models
The contact patterns among farms form a social network. Social network analysis (SNA) has been studied using both animal movement or trade networks of poultry (Van et al., Reference Van Kerkhove, Vong, Guitian, Holl, Mangtani, San and Ghani2009; Martin et al., Reference Martin, Zhou, Marshall, Jia, Fusheng, France Dixon, DeHaan, Pfeiffer, Magalhães and Gilbert2011a; Hosseini et al., Reference Hosseini, Fuller, Harrigan, Zhao, Arriola, Gonzalez, Miller, Xiao, Smith and Jones2013; Lee et al., Reference Lee, Suh, Jung, Lee, Seo, Moon and Lee2014; Moyen et al., Reference Moyen, Ahmed, Gupta, Tenzin, Khan, Khan, Debnath, Yamage, Pfeiffer and Fournie2018) and other animals (Nöremark et al., Reference Nöremark, Håkansson, Lewerin, Lindberg and Jonsson2011; Lebl et al., Reference Lebl, Lentz, Pinior and Selhorst2016). Networks can be presented by graphs, adjacency matrices, or a set of pairs. SNA utilizes the concepts of graph theory, which allows users to identify the essential components of a graph and find its key patterns. In fact, searching for dominant spreaders in networks is crucial in controlling epidemics. In other words, once the movement or spatial structure in an area is explained, it may disclose the implications of an infection spread throughout that area. Such insights can then assist in planning the containment policies.
Several metrics, such as centrality measures (Lee et al., Reference Lee, Suh, Jung, Lee, Seo, Moon and Lee2014; Moyen et al., Reference Moyen, Ahmed, Gupta, Tenzin, Khan, Khan, Debnath, Yamage, Pfeiffer and Fournie2018) are calculated to highlight AI introduction or spread risks in a defined network among chickens or flocks. In SNA model used in AI, a node usually represents a flock, market, or trader while an edge demonstrates a connection, usually movement, between those nodes. Furthermore, when all the nodes in a graph are directly or indirectly accessible from each node in that graph, the graph is called a strong component. If a node that is a part of a strong component becomes infected, that node is likely to infect all other nodes. Furthermore, if removing an edge or node in a graph divides the graph into two separated part, the spread of disease can be curtailed. Such nodes are known as a bridge or cut-point (Martínez-López et al., Reference Martínez-López, Perez and Sánchez-Vizcaíno2009).
The effect of network properties on the persistence of H5N1 virus has been evaluated within a poultry population (Hosseini et al., Reference Hosseini, Fuller, Harrigan, Zhao, Arriola, Gonzalez, Miller, Xiao, Smith and Jones2013). A stochastic simulation was constructed using the Gillespie algorithm considering a network of flocks, traders, and markets. The findings showed that the size of flocks and frequency of interactions among flocks play a role in the persistence of H5N1 infection and the pace at which an epidemic occurs.
There are several gaps in current network models of poultry movements. Although the effect of network measures, flock size, and movement frequency has been assessed in poultry, topologies of contact networks formed by the movement of traders have been overlooked. There are four known types of theoretical contact network: random, small-world, lattice, and scale-free networks. Moreover, social network analyses that have been performed in the literature of AI have not taken into account the temporal ordering of trade links. Recently, temporal network analysis has been used in pig trade networks (Lebl et al., Reference Lebl, Lentz, Pinior and Selhorst2016), where each connection has a time stamp denoting its occurrence time. Therefore, temporal network analysis can be used to better assessment of the impact of control measures in poultry.
Simulation studies
Under experimental conditions, an exploration of variability in transmission processes that takes place in real situations is impossible. Therefore, simulation models help to extrapolate results to field situations. For example, Reeves (Reference Reeves2012) developed a stochastic simulation model to incorporate within-flock transmission dynamics by estimating latent, subclinical infectious, and clinical infectious stages. In this study, a within-flock simulation of HPAI was performed for broiler chickens in three scenarios considering hourly timestamps, where the presence of the virus was detected based on a rise in bird mortality. It was concluded that not only could HPAI virus still spread in vaccinated flocks, but its detection time could also be delayed (silent spread). Finally, it was suggested that vaccination could be useful to reduce the degree of spread of HPAI between flocks.
Simulation studies have been performed where bird vaccination has also been included. Simulation is sometimes used to evaluate several vaccination strategies based on the obtained parameters from previous studies. For instance, Galvin et al. (Reference Galvin, Rumbos, Vincent and Salvato2014) performed a simulation study of vaccination strategies and compared it with non-vaccination practices, with the main objective of finding a cost-effective strategy. In order to simulate the impact of vaccination, a Susceptible-Exposed-Infectious-Recovered-Dead (SEIRD) compartmental model, in which ‘D’ represented an extra health state representing ‘dead’, was applied. In this model, chickens vaccinated with an inactivated virus vaccine transitioned directly from ‘S’ to the ‘R’ state. By taking into the costs associated with vaccination and losses due to mortality, it was concluded that immunization of 50% of the birds within a flock with the inactivated virus vaccine is the most cost-effective strategy. In another study, a simulation was conducted to estimate the transmission parameters of AIV in an unvaccinated group, a vaccinated group, and from an unvaccinated group to a vaccinated group (Poetri et al., Reference Poetri, Bouma, Murtini, Claassen, Koch, Soejoedono, Stegeman and Van Boven2009). Birds were regularly observed after vaccination before the observed data were fitted to Susceptible-Exposed-Infectious-Recovered (SEIR) simulation data by maximizing the likelihood of parameters. Finally, it was concluded that an H5N2 inactivated virus vaccine could reduce the susceptibility of chickens to HPAI H5N1 by 88%.
Behavioral-based models
This section explores compartmental and agent-based models (ABM). Compartmental models usually focus on the average behavior of a group while ABMs build detailed individual behaviors. In addition, compartmental models follow a top-down approach whereas ABMs follow a bottom-up approach. Top-down models utilize estimated parameters to simulate a process. Conversely, bottom-up models use simulated data to estimate parameters. Behavioral models can be deterministic or stochastic. Stochastic models consider random elements and run thousands of scenarios using simulation algorithms such as Gillespie. While the output of deterministic models is the same each time (Maidstone, Reference Maidstone2012), the output from different runs of a stochastic model varies and can be summarized in various ways.
Compartmental models: Compartmental models are simple population-based models, which are extensively used in AI research. These models are known as SIR or system dynamics (Thakur et al., Reference Thakur2015). Several SIR extensions such as Susceptible-Exposed-Infectious-Recovered, Susceptible-Infectious, Susceptible-Infectious-Susceptible, and Susceptible-Exposed-Infectious-Susceptible have also been introduced. In compartmental models, at each discrete time unit, a group of individuals may belong to one of the defined discrete classes based on the average health status of the group (Höhle and Jørgensen, Reference Höhle and Jørgensen2002; Dorjee et al., Reference Dorjee, Poljak, Revie, Bridgland, McNab, Leger and Sanchez2013). Simulation of disease spread using compartmental models is typically performed by differential equations. In a compartmental model, the risk of spread of an infection can be described by its basic reproduction ratio (R 0). R 0 denotes the number of cases that an infectious case can generate during its infection. For R 0 values greater or equal to one, an outbreak can take place and reach a peak while for R 0 values of less than one, there is no chance of major outbreak (Coburn et al., Reference Coburn, Wagner and Blower2009).
Agent-based models: A more recent and sophisticated group of models are known as stochastic individual-based, individual-centric, or ABMs. In these models, the behavior, histories, and properties (e.g. mobility) of every individual is taken into account. In addition, the population is heterogeneous, and the spatial structure of the population could be incorporated into the model. In stochastic models, the uncertainty and randomness of the real-world are denoted with probabilities. Therefore, stochastic ABM simulations produce a range of possible outcomes and contribute to the development of decision support tools (Taylor, Reference Taylor2003). ABMs have the potential to generate large amounts of data and the processing of such data may be challenging. Therefore, ABMs are expected to run slower (e.g. weeks) than compartmental models on a computer (Maidstone, Reference Maidstone2012). Similar to agent-based modeling studies of infectious disease in pigs, ABMs in the spread of AI (Patyk et al., Reference Patyk, Helm, Martin, Forde-Folle, Olea-Popelka, Hokanson, Fingerlin and Reeves2013; Lewis et al., Reference Lewis, Dorjee, Dubé, VanLeeuwen and Sanchez2017) have been performed using the NAADSM conceptual framework.
Component-based simulation
This section divides the simulation models into within-flock and between-flock models based on the resolution that can be accounted in the models. Between-flock transmission models, which consider a flock as the unit of interest have a lower resolution than within-flock transmission models.
Within-flock transmission models: Within-flock transmission of AIV refers to transmission of the virus among birds within a single flock. Within-flock transmission simulations have been performed in poultry flocks (Reeves, Reference Reeves2012; Weaver et al., Reference Weaver, Malladi, Goldsmith, Hueston, Hennessey, Lee, Voss, Funk, Der, Bjork, Clouse and Halvorson2012) using stochastic state transition conceptual models. In fact, the transmission equation calculates the number of birds transitioning between states of disease in a time period. The transmission model is then used in conjunction with a simulation model to allow for a scenario-based understanding of disease spread in a flock. For example, within-flock simulation models are used to assess the impact of vaccination or virus strain on transmission.
There are several gaps that can be addressed in within-flock transmission simulation studies. The output of simulation models may not be representative of real-field data because the experimental settings might be different from one flock to another due to differences in flock characteristics such as housing systems and flock management, in addition to differences in virus strains. To address this, field data can be collected by building wireless sensor networks to track virus transmission behaviors for each flock. Furthermore, a number of transmission dynamics such as temperature, wind direction, ventilation system, and humidity have been overlooked in within-flock transmission studies. Sound results generated by within-flock transmission models can then be utilized in the development of parameters of between-flock transmission models.
Between-flock transmission models: Between-flock transmission of AIV refers to a direct (e.g. bird movement and bird trade) or an indirect transmission (e.g. human contact, shared trucks, and dust) of AIV among poultry flocks. According to Pepin et al. (Reference Pepin, Spackman, Brown, Pabilonia, Garber, Weaver, Kennedy, Patyk, Huyvaert and Miller2014), performing between-flock experimental studies is impossible, as it is expensive and life-threatening.
Therefore, transmission models have been used in AI modeling to generate hypotheses on the impact of control measures and find optimal prevention solutions (Mannelli et al., Reference Mannelli, Busani, Toson, Bertolini and Marangon2007; Mulatti et al., Reference Mulatti, Bos, Busani, Nielen and Marangon2010; Lee et al., Reference Lee, Suh, Jung, Lee, Seo, Moon and Lee2014; Backer et al., Reference Backer, van Roermund, Fischer, van Asseldonk and Bergevoet2015). Between-flock transmission models in AI have used probability-based (Dorigatti et al., Reference Dorigatti, Mulatti, Rosà, Pugliese and Busani2010; Ssematimba et al., Reference Ssematimba, Hagenaars and De Jong2012; Backer et al., Reference Backer, van Roermund, Fischer, van Asseldonk and Bergevoet2015), agent-based (Patyk et al., Reference Patyk, Helm, Martin, Forde-Folle, Olea-Popelka, Hokanson, Fingerlin and Reeves2013; Lewis et al., Reference Lewis, Dorjee, Dubé, VanLeeuwen and Sanchez2017), and network-based (Van et al., Reference Van Kerkhove, Vong, Guitian, Holl, Mangtani, San and Ghani2009; Lee et al., Reference Lee, Suh, Jung, Lee, Seo, Moon and Lee2014) approaches. Probability-based methods follow a top-down approach to estimate a kernel function that usually combines disease dynamics and distance between farms. This is due to the lack of detailed information on the level of contribution of each factor to an outbreak. On the other hand, agent-based and network-based approaches usually follow a bottom-up procedure to assess the effectiveness of control strategies. Network-based model are suitable when network characteristics of a set of flocks and their biosecurity indicators need to be considered as risk factors for AI outbreak prediction (Martin et al., Reference Martin, Pfeiffer, Zhou, Xiao, Prosser, Guo and Gilbert2011b). Notably, the above approaches are not necessarily mutually exclusive, meaning a model can be built based on more than one approach.
In a study by Mulatti et al. (Reference Mulatti, Bos, Busani, Nielen and Marangon2010), a top-down approach was followed to find the best intervention policies for reduction of virus transmission between flocks. In the study conducted by Mulatti et al. (Reference Mulatti, Bos, Busani, Nielen and Marangon2010), data from four previous LPAI epidemics with different interventions were fitted to a Susceptible-Infectious-Depopulated model. Subsequently, using univariate and multivariate analysis, the risk ratio and risk reproduction number (R 0) were estimated to identify the most effective policies.
In the Netherlands, Ssematimba et al. (Reference Ssematimba, Hagenaars and De Jong2012) studied the role of downwind dust in the spread of HPAI H7N7 between poultry flocks. Particle deposition and virus decay were included when assembling the dispersion model. It was concluded that wind-borne pathogen transmission alone is not enough to explain the incidence of AI. However, for nearby surroundings, the wind-borne route plays a substantial role.
In this review, simulation studies are placed in the category of small-data modeling. However, it is worth noting that these studies could be considered as data-intensive modeling methods when the parameter space is large. In this case, as a wide range of values is assigned to parameters, the timeliness of processes needs to be taken into account as well. Optimization algorithms, such as GA, can provide an effective search in the parameters space. In addition to a large parameter space, simulation approaches generally result in a large volume of output data. Extracting meaningful patterns from such data can be a computationally expensive task. Therefore, ML algorithms and big data stream processing techniques need to be considered in future transmission simulation models pertaining to AI.
Additional recommendations
There are several limitations regarding data sources that have been exploited in recent studies focused on AI modeling. To begin, there are specific locations that have received more attention than others due to the availability of data, or a high number of confirmed cases in a specific area. For instance, a field survey in Phitsanulok province in Thailand has been used by several authors for AI modeling (Wiratsudakul et al., Reference Wiratsudakul, Paul, Bicout, Tiensin, Triampo and Chalvet-Monfray2014; Wilasang et al., Reference Wilasang, Wiratsudakul and Chadsuthi2016). Another example is a dataset from an outbreak of H7N7 in the Netherlands in 2003, which has been used several times in the literature (Stegeman et al., Reference Stegeman, Bouma, Elbers, de Jong, Nodelijk, de Klerk, Koch and van Boven2004; Boender et al., Reference Boender, Elbers and de Jong2007; Bavinck et al., Reference Bavinck, Bouma, Van Boven, Bos, Stassen and Stegeman2009; Bos et al., Reference Bos, Nielen, Koch, Bouma, De Jong and Stegeman2009). Such data lead to findings that may not be generalizable to other locations and different virus strains. The above-mentioned retrospective studies infer insights about previous outbreaks in specific regions. However, poultry health authorities need to gain global knowledge about the underlying mechanisms of AI outbreaks. As a result, generalizing the insights gained from studies that focus on one specific time and region to other times and regions is still challenging. In addition, it is of interest to know that Twitter has been the center of interest in digital surveillance studies. However, to the best of our knowledge, the potential of blogs, search engines, and news feeds have been overlooked in AI surveillance studies. Furthermore, AI risk-based studies define and test hypotheses to investigate the impact of risk factors on AI outbreaks. The hypotheses are usually generated based on past observations or the literature. FS, a pre-processing technique in KDD, can generate other hypotheses that may represent a more precise behavior of AI.
The data cleaning step of KDD seems to be more commonly practiced in AI modeling studies compared to other pre-processing techniques including data integration, data transformation, and data reduction. Syndromic surveillance studies in social media, for instance, use natural language processing methods such as tokenization, stemming, lemmatization, and stop word removal to clean text data (Lee et al., Reference Lee, Agrawal and Choudhary2013; Chen et al., Reference Chen, Hossain, Butler, Ramakrishnan and Prakash2016; Ghosh et al., Reference Ghosh, Chakraborty, Nsoesie, Cohn, Mekaru, Brownstein and Ramakrishnan2017).
Pre-processing of data is considered a more time-consuming phase of KDD compared to other phases (Tsumoto, Reference Tsumoto2000; García et al., Reference García, Ramírez-Gallego, Luengo, Benítez and Herrera2016). It is estimated that pre-processing takes about 80% of the entire time allocated to a project (Duhamel et al., Reference Duhamel, Nuttens, Devos, Picavet and Beuscart2003; Pérez et al., Reference Pérez, Iturbide, Olivares, Hidalgo, Martínez and Almanza2015). As a result, to save time, performing this step simultaneously with data collection is recommended.
An important consideration is that decisions during AI emergencies need to be timely and rapid. Simultaneously, there is a rise in new and large digital data sources in epidemiology (Salathe et al., Reference Salathe, Bengtsson, Bodnar, Brewer, Brownstein, Buckee, Campbell, Cattuto, Khandelwal and Mabry2012). Therefore, parallel and distributed KDD methods may be used to enhance the performance of knowledge extraction from large datasets. In the current era, with advancements in computing power, traditional algorithms of DM need to be adjusted in order to fit cutting-edge computing approaches, such as those being used in Hadoop (White, Reference White2012).
Conclusions
The work presented here provides an overview of the modeling methods that have been proposed for control of AI. Furthermore, the present survey has highlighted AI research limitations with regard to the KDD process. As new technologies improve, AI modeling is turning into a data-intensive and multidisciplinary field, with high volume, variety, and velocity of data. Therefore, small data methods introduced here, in particular, need to be adapted to state-of-the-art analytical approaches to reveal new patterns that have previously been overlooked. This might consequently minimize the financial, animal health, and public health impacts of AI.
Acknowledgments
This work was funded by Egg Farmers of Canada, Chicken Farmers of Saskatchewan, and the Canadian Poultry Research Council. This research is supported in part by the University of Guelph's Food from Thought initiative, and the authors acknowledge funding from the Canada First Research Excellence Fund.