Categorizing biological information based on function–morphology for bioinspired conceptual design

Sooyeon Lee; Daniel A. McAdams; Elissa Morris

doi:10.1017/S0890060416000500

Categorizing biological information based on function–morphology for bioinspired conceptual design

Published online by Cambridge University Press: 05 December 2016

Sooyeon Lee ,

Daniel A. McAdams and

Elissa Morris

Show author details

Sooyeon Lee: Affiliation:
Department of Mechanical Engineering, Texas A&M University, College Station, Texas, USA
Daniel A. McAdams*: Affiliation:
Department of Mechanical Engineering, Texas A&M University, College Station, Texas, USA
Elissa Morris: Affiliation:
Department of Mechanical Engineering, Texas A&M University, College Station, Texas, USA
*: Reprint requests to: Daniel A. McAdams, Texas A&M University, Mechanical Engineering Office Building, 3123 TAMU, College Station, TX 77843, USA. E-mail: dmcadams@tamu.edu

Article contents

Abstract
INTRODUCTION
BACKGROUND AND RESEARCH PROCEDURES
ALGORITHM
ANALYSIS OF RESULTS
LIMITATIONS AND FUTURE WORK
CONCEPTUAL DESIGN EXAMPLE: DESIGN AN ANTI-IMPACT FABRIC
CONCLUSION
References

Rights & Permissions

Abstract

A function-based keyword search is a concept generation methodology studied in the bioinspired design area that conveys textual biological inspiration for engineering design. Current keyword search methods are inefficient primarily due to the knowledge gap between engineering and biology domains. To improve current keyword search methods, we propose an algorithm that extracts and organizes morphology-based solutions from biological text. WordNet is utilized to discover morphological solutions in biological text. The novel algorithm also adapts latent semantic analysis and the expectation–maximization algorithm to categorize morphological solutions and group biological text. We introduce a novel penalty function that reflects the distance between functions (problems) and morphologies (solutions). The penalty function allows the algorithm to extract morphological solutions directly related to a design problem. We compare the output of the algorithm to manually extracted solutions for validation. A case study is included to exemplify the utility of the developed algorithm. Upon implementation of the algorithm, engineering designers can discover innovative solutions in biological text in a straightforward, efficient manner.

Keywords

Bioinspired Design Conceptual Design Keyword Search Morphology Text Mining

Type: Regular Articles
Information: AI EDAM , Volume 31 , Special Issue 3: Uncertainty Quantification for Engineering Design , August 2017 , pp. 359 - 375

DOI: https://doi.org/10.1017/S0890060416000500 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2016

1. INTRODUCTION

Bioinspired design uses nature as a source of inspiration for solving problems in the engineering field. Though there is evidence that biological systems can provide a wealth of elegant and ingenious approaches to problem solving (Hacco & Shu, Reference Hacco and Shu2002; Shu et al., Reference Shu, Ueda, Chiu and Cheong2011; Vattam et al., Reference Vattam, Wiltgen, Helms, Goel, Yen, Taura and Nagai2011), challenges exist that prevent designers from taking full advantage of the biological knowledge domain. A key challenge is the knowledge gap between the disciplines of biology and engineering. The lack of biological expertise on the part of the engineer makes it difficult to find solutions in nature that might be adapted to an engineering problem. Even when a solution is identified, the engineer may experience difficulty understanding the description of the biological system, thereby suppressing effective design inspiration. While an obvious assumption may be that biologists are the most efficient in bioinspired design, researchers have demonstrated that an educational background in biology does not improve one's ability to use nature as inspiration for engineering design (Tsenn et al., Reference Tsenn, McAdams and Linsey2015).

Several research efforts have been made to bridge the knowledge gap between the engineering and biology domains (Hirtz et al., Reference Hirtz, Stone, McAdams, Szykman and Wood2002; Chakrabarti et al., Reference Chakrabarti, Sarkar, Leelavathamma and Nataraju2005; Nagel et al., Reference Nagel, Stone and McAdams2010; Cheong et al., Reference Cheong, Chiu, Shu, Stone and McAdams2011; Shu et al., Reference Shu, Ueda, Chiu and Cheong2011). These studies help establish a relationship between engineering functional keywords and biologically meaningful keywords, laying the groundwork for a textual connection between the two domains. Based on these efforts, an engineer can translate his or her desired engineering function into an equivalent biological function and then search for a biological system using the translated function. This function-based analogy enables biological systems to be mined for strategies, principles, and morphologies that can be used to solve engineering problems. In this paper, we refer to this bioinspired design approach as a biological keyword search, or more simply, a keyword search.

A keyword search finds solutions in nature contained in biological text, such as academic journals or biological textbooks. The text passages containing the keyword are returned to the designer. Ideally, the returned passages contain a description of the system that performs the keyword function. The designer may then review the returned passages for design inspiration. However, the main drawback of keyword search methods is the large volume of passages returned to the user. Many of the returned passages do not contain specific biological solutions in the text. Therefore, designers must make a considerable investment of time reading all of the passages to find a biological solution. Moreover, designers must repeatedly read information that is not useful to their engineering problem. This arduous process prevents effective use of keyword search methods. Research efforts have been made to improve keyword search methods for bioinspired design by focusing on areas such as scalability (Vandevenne et al., Reference Vandevenne, Verhaegen, Dewulf and Duflou2016), text categorization (Ke et al., Reference Ke, Wallace and Shu2009), and improved search term formulation (Kaiser et al., Reference Kaiser, Hashemi Farzaneh and Lindemann2014). However, we take a unique approach to addressing the drawbacks of function-based biological keyword search methods.

The core research objective is to develop an algorithm to identify solution morphologies in the returned passages and cluster the found biological solutions based on these morphologies. Thus, only passages containing morphological terminology are returned to the user. In addition, the returned passages are presented to the designer in a focused and organized manner to further improve the user's search for solutions and design inspiration. WordNet noun categorization is utilized to extract information related to morphology in biological text. Latent semantic analysis (LSA), which is widely used for text clustering in information retrieval, is also utilized to cluster and classify the returned passages based on the morphological solution.

The computational theories used are described in the following section. The novel algorithm is then presented followed by an evaluation section comparing the output of the algorithm with manually produced results. The final section concludes this study and outlines immediate future work.

2. BACKGROUND AND RESEARCH PROCEDURES

A discussion of morphology is included in this section. The discussion of morphology provides context for usage of the term as well as motivation for how finding morphological solutions relevant to design problems can benefit conceptual design. Representational schemes from the design community and related computational approaches upon which this work builds are also presented in detail below.

2.1. Morphology

The Oxford English Dictionary provides a general definition of the word morphology as “a particular form, shape, or structure.” However, the specific meaning of the term morphology changes depending on the field of study. In linguistics, morphology refers to the study of the structure of linguistic units, such as parts of speech and content. In biology, morphology refers to the study of the structure of living organisms. This research effort uses the definition of morphology as applied to biology. Our focus on morphological elements of a physical process, entity, or solution in this research is due to the importance of morphology in function-based conceptual design as well as its frequent usage in biological texts. Identifying the morphology used in the execution of some biological function leads to increased understanding of how biological systems may be adapted to engineered systems.

We propose that identifying biological morphology as related to function can also enlarge the engineers’ conceptual design space. An enlarged structural design space has been shown to contribute to creative design (Gero & Kazakov, Reference Gero and Kazakov1999). When a designer learns a relation of function and structure, researchers have shown that it is difficult to discard that studied knowledge (Qian & Gero, Reference Qian and Gero1992). Thus, the conceptual design solution space can be enlarged if a designer is introduced to a completely new structure and shape, thereby contributing to creative design. Importing biological morphology into the engineering design solution space is expected to help engineers discover and create unique morphologies not found in conventional engineering designs.

This research effort aims to mine morphological solutions in biological text returned by functional keyword searches. We propose that mining the returned biological passages for morphology will enable designers to import innovative morphological solutions to an engineering problem. The detailed research approach to discover morphological expressions in the text is discussed in the algorithm section of this paper.

2.2. LSA

As previously discussed, keyword search methods tend to be inefficient due to the large quantity of passages returned to the user. To address this problem, we propose an algorithm that uses LSA to return clustered and organized results to the designer. LSA is an information retrieval technique that is often used in automatic text (or document) clustering and topic identification. LSA can estimate the relatedness of a word to a larger text unit or as a computational model that is fundamental to obtain or use knowledge (Landauer et al., Reference Landauer, Foltz and Laham1998). In the context of this research, LSA enables the proposed algorithm to find similar morphological noun groups as well as to mine morphological solutions in documents.

LSA utilizes singular value decomposition (SVD) to extract relations between documents based on words. Research has demonstrated the close correlation between SVD categorization and human judgment-based categorization (Laham & Foltz, Reference Laham and Foltz1998). LSA can directly use term–document vectors and apply SVD to those vectors to find the relation between terms and documents. However, for the purpose of improving accuracy, in many cases, a weight function is applied to term–document vectors before SVD is applied. A co-occurrence matrix is transformed by weight functions before applying SVD in the LSA process (Nakov et al., Reference Nakov, Popova and Mateev2001).

The weight function is equivalent to the product of local and global weight functions. Local weight functions signify the importance of a word to a document, and global weight functions reflect the importance of a word to a collection of documents. Thus, the weight function w(i, j) can be expressed as the following (Nakov et al., Reference Nakov, Popova and Mateev2001):

$$w\lpar {i\comma\, j} \rpar = L\lpar {i\comma\, j} \rpar G(i)\comma $$

where L(i, j) refers to the local weight function and G(i) refers to the global weight function (Nakov et al., Reference Nakov, Popova and Mateev2001).

Various weighting schemes have been tested in previous studies (Salton & Buckley, Reference Salton and Buckley1988; Nakov et al., Reference Nakov, Popova and Mateev2001). We use a combination of logarithmic term frequency (TF) and inverse document frequency (IDF), as this approach has shown acceptable accuracy in previous studies (Nakov et al., Reference Nakov, Popova and Mateev2001; Wild et al., Reference Wild, Stahl, Stermsek and Neumann2005). The weight function we use is represented as

$$w\lpar {i\comma\, j} \rpar = \log \!\lpar {\hbox{TF}\lpar {i\comma\, j} \rpar + 1} \rpar \times \hbox{log}\displaystyle{N \over {df\,(i)}}.$$

2.3. Expectation-maximization (EM) algorithm

Considering that LSA performs only a light clustering, the EM algorithm will be applied to the truncated term vectors from SVD. K-means clustering can be used as an initialization step in the EM algorithm. K-means clustering generates a random seed centroid of clusters and repeats the iterative process that consists of data assignment and recalculating centroids of clusters. In this iterative process, each data point is assigned to the nearest centroid, which minimizes the residual sum of squares, defined as $\mathop \sum \nolimits_{i = 1}^k \mathop \sum \nolimits_{\vec x \in S_k} \vert\vert\vec x - \rm \mu _i\vert\vert^2 $ . Here, $\vec x$ , μ_i, and S _k refer to a data vector, a mean of the ith cluster, and the set of the kth cluster, respectively. Then, a new centroid is recalculated as a means of data points in a set. The process repeats until the stop criterion is met (Manning et al., Reference Manning, Raghavan and Schütze2008).

The EM algorithm (Dempster et al., Reference Dempster, Laird and Rubin1977) is used as the main clustering algorithm because of its superior performance over K-means. Briefly, the EM algorithm locates the hidden or missing parameter by finding the maximum likelihood from a given data set. The EM algorithm consists of two steps: the expectation step (E-step) and the maximization step (M-step). The E-step estimates the expected hidden variables from the current model. The M-step finds the new model that maximizes the log likelihood about a given model and currently estimated hidden variables. These steps iteratively update the model and terminate the process when the log-likelihood saturates. The threshold is set to 1e-06 in the standard natural language toolkit module for the language model in Python. The EM algorithm usually sets the hidden variable membership of each data point, or vector, to a cluster and returns which term vectors have a high chance to be in a cluster by clustering membership weight. This research also uses the Gaussian mixture model for the EM algorithm. The Gaussian mixture model assumes that data consists of k Gaussian distributions and determines the mean and covariance.

K-means and EM clustering both need a predetermined number of clusters for sorting. To create a desired number of clusters, gap-statistics theory is applied (Tibshirani et al., Reference Tibshirani, Walther and Hastie2001). Gap-statistics measures compactness of clusters using the within-cluster sum of squares function W _k, which is defined as

$$W_k = \sum \limits_{r = 1}^k \displaystyle{1 \over {2\vert {C_r} \vert}}D_r = \sum \limits_{r = 1}^k \ \displaystyle{1 \over {2\vert {C_r} \vert}} \sum \limits_{i \in C_r} \le {\Vert {x_i - {\rm \mu }} \Vert}^2,$$

where D _r is the sum of pairwise distances for all points in a cluster and C _r, x _i, and μ refer to the indices of observation in cluster r, the ith data point, and the cluster mean, respectively. Gap statistics finds the value k, which maximizes the gap defined as

$${\rm Gap}(k) = E{\ast}({\rm log}\,W_k)-{\rm log}\, W_k,$$

where E* denotes the expectation of a sample from the reference distribution, which is the null hypothesis of random noise.

2.4. Dimension selection in LSA

Determining the dimensions of SVD is an important factor in LSA. Reducing dimensions can eliminate noise but also exclude potentially useful data. The appropriate number of singular values is heavily studied in different areas and remains an area of debate and discovery (Landauer et al., Reference Landauer, Foltz and Laham1998; Bingham & Mannila, Reference Bingham and Mannila2001; Globerson & Tishby, Reference Globerson and Tishby2003). Reduced dimensions between 200 and 600 are widely accepted for large corpora but not for small corpora (Villalon & Calvo, Reference Villalon and Calvo2009). Our corpora is small and requires a different method for detemining acceptable dimensions for the SVD.

The common method to determine acceptable dimensions for small corpora is searching for a sudden “elbow” of the eigenvalues on plots similar to the plot shown in Figure 1. If there is a sudden drop in value between two eigenvalues as one moves along the diagonal, this drop can be interpreted as the transition of important eigenvalues to unimportant eigenvalues. However, the eigenvalues for the documents used in this research tend to show an almost linear decrease, as shown in Figure 1. A clear, sudden drop in eigenvalues did not occur. Therefore, another method must be used to find the threshold of eigenvalues for the SVD.

Fig. 1. Example eigenvalues of a diagonal matrix when the penalty function is α = 0.7.

Two common methods are tested to find the proper dimensions in the SVD process: percent variance and profile likelihood (Jolliffe, Reference Jolliffe2002; Zhu & Ghodsi, Reference Zhu and Ghodsi2006). These methods are introduced in principal component analysis (PCA). PCA and LSA both use SVD to try to find the proper dimensionality in SVD. Therefore, the methods to find the dimensionality of SVD in PCA are applied to find the dimensionality of SVD in LSA. The percent variance method and the profile likelihood method are discussed below.

1. Percent variance method: If we let the singular values in a diagonal matrix in LSA be {s₁,s₂, s₃, … ,s_n}, a percent variance method will maintain the k eigenvalues that satisfy the following equation:
$$\displaystyle{{\{ s_1 + s_2 + s_3 + \ldots + s_k\}} \over {\sum\nolimits_{i = 1}^n {s_i}}} \ge {\rm \xi}. $$

Typically, ξ is set to 70%, 80%, or 90%. However, researchers have shown that a ξ value of 60% produced reasonably good results when automatically scoring essays using LSA (Wild et al., Reference Wild, Stahl, Stermsek and Neumann2005). We will test ξ values on a set of {10%, 20%, … , 90%}.
2. Profile likelihood method: Researchers have used the likelihood function to find the “elbow” that could optimally distinguish a group of eigenvalues ahead of the “elbow” and a group of eigenvalues after the “elbow” (Zhu & Ghodsi, Reference Zhu and Ghodsi2006) This method divides the eigenvalues into two groups: ${\cal G}_{1} = \lcub {s_{1} + s_{2} + s_{3} + \ldots + s_{k}} \rcub $ and ${\cal G}_2 = \lcub {s_{k} + s_{k + 1} + s_{k + 2} + \ldots + s_{n}} \rcub $ . The profile log-likelihood for k, ${\cal L}_k(k)$ , can be written as shown below.

$${\cal L}_k\lpar k \rpar = \mathop \sum \limits_{\,j = 1}^k \log \ f_j\lpar {\hbox{s}_j\semicolon\ {\rm \mu}_1\comma\ {\rm \sigma}} \rpar + \mathop \sum \limits_{\,j = k + 1}^n \log \ f_j\lpar {\hbox{s}_j\semicolon\ {\rm \mu}_2\comma\ {\rm \sigma}} \rpar\comma $$

$$f_j\lpar {{s}\semicolon\ {\rm \mu}_j\comma\ {\rm \sigma}} \rpar = \displaystyle{1 \over {{\rm \sigma} \sqrt {2{\rm \pi}}}} \exp [{{{ - {\lpar {\hbox{s} - {\rm \mu}_j} \rpar }^2} / {2\sigma^2}}}] \quad\hbox{for}\; j = 1\comma\; 2\comma $$

$${\rm \mu} _1 = \displaystyle{{\mathop \sum \nolimits_{s_i \in {\cal G}_1}\! s_i} \over k}\comma $$

$${\rm \mu} _2 = \displaystyle{{\mathop \sum \nolimits_{s_i \in {\cal G}_2}\! s_i} \over {n - k}}\comma $$

$${\rm\sigma} = \displaystyle{{\lpar {k - 1} \rpar {\rm\sigma} _1 + \lpar {n - k - 1} \rpar {\rm\sigma} _2} \over {n - 2}}.$$

In addition, σ₁ and σ₂ are the variances of ${\cal G}_1$ and ${\cal G}_2$ . The objective of the profile likelihood method is to find the k value that satisfies the objective function

$$\mathop {\hbox{argmax}}\nolimits_{\,p = 1\comma 2\comma\ldots\comma n} {\cal L}_k\lpar p \rpar $$

3. ALGORITHM

This section illustrates how our algorithm finds morphologies in biological texts and clusters these texts according to the found morphologies. Pseudocode is provided in the appendix. Figure 2 shows a schematic of the entire computational process. Each computational step is discussed in the following sections.

Fig. 2. Computational process of the developed clustering algorithm.

3.1. Step 1: Preprocessing text and searching functional terms

Step 1 requires preprocessing the biological text file. We consider a paragraph as the unit of a document. Thus, the text file is initially divided into paragraphs. Paragraphs are used for passage sampling because the specific sentence containing the functional term may not have the associated morphological description. By searching the entire paragraph, we increase the chances of finding a clear morphological description. Then the paragraph is parsed using TreeTagger, which is a sentence parsing program that finds a stem word (lemma) and tags a part of speech (POS) for every word in the paragraph (Schmid, Reference Schmid, Jones and Somers2013).

Step 1 also requires translation and expansion of the input keyword. The knowledge gap that exists between the engineering and biology domains creates a lexical gap between domains as well. The engineering-to-biology (E-to-B) thesaurus provides a translation of engineering functional terms to biological functional terms (Nagel et al., Reference Nagel, Stone and McAdams2010). Therefore, using the E-to-B thesaurus translates, and expands into all known synonyms for the E-to-B thesaurus, the input keyword to its corresponding biological keywords. In this case, the initial search keywords are functions as defined by the functional basis (Little et al., Reference Little, Wood and McAdams1997; Hirtz et al., Reference Hirtz, Stone, McAdams, Szykman and Wood2002). The translated terms are from the thesaurus as reported by Cheong et al. (Reference Cheong, Chiu, Shu, Stone and McAdams2011) and Nagel et al. (Reference Nagel, Stone and McAdams2010).

Once the text is preprocessed and the input keyword is translated and expanded, the algorithm filters paragraphs that contain functional keywords. These keywords must be used in a verb form, which includes base form, past tense, gerund, past participle, present singular, and present plural, based on TreeTagger POS tags.

3.2. Step 2: Collect morphological nouns in filtered paragraphs

Step 2 identifies morphological nouns using the following WordNet noun categories: artifact, attribute, body, and shape. Definitions and examples of these noun categories are shown in Table 1. WordNet features numerous noun categories, but the specific categories shown in Table 1 are chosen based on the notion of morphology as it is most relevant to conceptual design.

Table 1. Noun categories in WordNet

Morphology consists of two parts: shape and structure. Thus, the shape noun category is selected. The categories representing structure are chosen using the WordNet definition of structure, shown in Figure 3. As we search for morphological descriptions in Figure 3, we are interested in both “a thing constructed” and “the manner of construction of something.” Thus, the artifact noun and attribute noun categories are selected. The final category, body noun, is selected because natural components, such as a tubule or a tooth, are usually expressed using body nouns instead of artifact nouns. All remaining noun categories are not considered morphological nouns, such as cognition noun and group noun shown in Figure 3, because they are not related to the definition of morphology.

Fig. 3. Decomposition of morphology definition: shape and structure. Categories representing “structure” are selected based on noun categories related to the definition of structure in WordNet.

TreeTagger POS tags are used to collect morphological nouns. However, some morphologies are expressed as adjectives. For example, “hexagonal” from “hexagonal shape” or “square” from “square pads” both indicate morphologies. However, the algorithm does not identify these terms as morphological nouns because they are tagged as adjectives. Thus, the lemma of an adjective is also examined through WordNet to determine if it is a morphological expression.

3.3. Step 3: Generate term–document matrix and apply weight functions

Step 3 produces a term–document co-occurrence matrix. The rows of this matrix represent the morphological nouns collected in Step 2, and the columns represent the paragraphs filtered in Step 1. Each matrix entry indicates the frequency of its morphological noun (row) in its paragraph (column).

After the co-occurrence matrix is generated, a penalty function is applied to the term–document matrix. The penalty function helps the algorithm find morphological nouns that are related to the functional keyword of interest. The penalty function is different than the weight function. The weight function determines term importance in a passage and in the entire collection of passages, based on term frequency. The penalty function operates locally in a passage and finds morphological terms that describe the functional keyword. The morphologies that are closely related to the searched keyword are often located nearby in the passage, as shown in Figure 4.

Fig. 4. Importance of morphological terms (organization and capillaries) according to the distance away from a functional keyword (surround).

To increase the importance of the morphological nouns closer to the functional verb, we develop and apply a penalty function to the term–document matrix. The penalty function λ is expressed as

$${\rm\lambda} \lpar {i\comma\, j} \rpar = {\rm \alpha} ^{\lpar \delta \rpar } \times {\rm TF}\lpar {i\comma\, j} \rpar\comma $$

where α and TF refer to a weight number less than 1, and term frequency, respectively. The symbol δ indicates the sentence order difference between the sentence that contains morphological noun and the sentence that contains the functional verb. Using Figure 4 to illustrate, the morphological term “organization” has a δ value of 3, and “tubule” has a δ value of 0 because “tubule” is in the same sentence as the functional verb “surround.” If there are multiple functional keywords within a passage, the average δ value of each morphological term will be used. Including the penalty function modifies the weight function, which is applied to a term–document matrix before performing SVD. The modified weight function is represented by

$$\eqalign{w\lpar {i\comma\ j} \rpar & = \log \!\lpar {\lambda (i\comma\ j) + 1} \rpar \times \hbox{log}\displaystyle{N \over {df(i)}} \cr & = \log\! \lpar {{\rm \alpha}^{\lpar {\rm \delta} \rpar } \!\times {\rm TF}(i\comma\ j) + 1} \rpar \times \hbox{log}\displaystyle{N \over {df(i)}}.}$$

3.4. Step 4: Apply LSA to morphology vector

The main objective of LSA is clustering similar morphological solutions. LSA can discover latent relations between words by measuring the frequency of co-occurring words in each document. For example, LSA can determine similarities between the words “apple” and “orange,” even if these words are not in the same document, by their shared words in each document, such as “tree” or “juice.” Likewise, LSA can conclude that two morphological nouns, such as “pill” and “drug,” are similar, even if they are not in the same passage.

After the dimensionality, k, is determined, LSA maps the morphology vectors and document (paragraph) vectors to the k-dimensional vector spaces. Through this process, the latent relation (dissimilarity) between morphological nouns can be discovered, and the algorithm can prepare to apply a clustering algorithm to the truncated morphological vectors.

3.5. Step 5: Clustering morphological nouns by EM algorithm

The EM algorithm is used because LSA does not actually group words or documents. Rather, LSA provides the relatedness of word-to-word, document-to-document, and word-to-document. Our algorithm categorizes morphological nouns by applying the EM algorithm to the truncated morphological vectors. The details of the EM algorithm have been discussed previously.

3.6. Step 6: Cluster passages based on morphology groups

Step 6 groups passages according to the morphological clusters generated in Step 5. The passages that have more than p common morphologies from one morphological cluster are grouped as a passage cluster. Table 2 provides definitions of the nomenclature.

Table 2. Definitions of nomenclature

Let us assume the algorithm has q morphology clusters. If passage 1 and passage 2 both have more than p morphologies from the fourth morphological cluster, they are grouped together as belonging to the fourth passage cluster. Consequently, if there are q number of morphology clusters, the number of passage clusters is also q . Some passages are not included in any passage cluster and will be provided separately.

This method of passage clustering indicates that one passage can have multiple morphological solutions. The number of morphological solution groups contained in one passage is unknown. Determining the quantity of morphological solutions that can be extracted from one passage is complicated with LSA and EM algorithms. However, we can assign multiple solution groups to one passage easily with the grouping scheme presented, even if we do not have a predetermined number of morphological solution groups.

Determining the p value requires a compromise between the number of overlapped passages and the number of passages that are not clustered. Assuming T represents the summation of the number of passages in all passage clusters and A represents the number of unique passages, when p is too small, T might be too large, which indicates an engineer would have to read too many overlapped passages. Conversely, if p is too large, the number of passages not included in any of the passage clusters will increase, which is also not desirable. Thus, the p value should be determined to make T larger than A , but also to minimize the difference between T and A . Usually, p has a value of 5 or 6 for a paragraph-unit document.

The morphologies in each passage in a passage cluster are aligned according to their importance using the truncated matrix generated by LSA. The value at the intersection of a column and a row in the truncated matrix indicates the degree of relatedness of the row word in the column document. A value closer to 1 indicates that the term is more relevant to the passage. The morphological noun that has a negative value in the truncated matrix is excluded, because the negative value indicates poor relatedness of a document and the functional term. The sorted morphologies in a passage are then highlighted, as shown in Figure 5.

Fig. 5. Sample screenshot of the algorithm output. Morphological terms are highlighted, and the searched keyword is underlined.

4. ANALYSIS OF RESULTS

This section will examine how well the developed algorithm categorizes passages based on important morphologies. The algorithm performance is evaluated through a comparison of morphologies extracted by the algorithm and morphologies manually identified. Two annotators manually established what we term the gold standard. Morphologies that both annotators agreed were the form and structure that the natural system used to execute the function as identified by the presence of the functional keyword in a passage were selected as the gold standard.

The text corpus used for this evaluation is from the biological textbook Life: The Science of Biology (Purves et al., Reference Purves, Orians, Sadava and Heller2003). Here, we illustrate an example with the input keyword “inhibit.” As a result, 46 paragraphs are retrieved and categorized according to morphological solutions. The minimum number of morphologies shared by passages in the same passage cluster p is set to 5 to ensure a uniform test condition. In addition, the system is set to retrieve the top 10 most important morphological nouns in a passage per morphological cluster. Precision, recall, and F1 score (Manning & Schütze, Reference Manning and Schütze1999) are used to evaluate the system and are defined as follows:

$$\eqalign{{\rm Precision} & = \displaystyle{{{\rm Relevant\; retrieved\; document}} \over {{\rm Retrieved\; document}}} \cr \quad{\rm Recall\; } & = {\rm \; }\displaystyle{{{\rm Relevant\; retrieved\; document}} \over {{\rm Relevant\; document}}} \cr \quad F1{\rm \; score} & = \displaystyle{{2{\ast}({\rm Precision} \cdot {\rm Recall})} \over {{\rm Precision} + {\rm Recall}}}.}$$

Two separate methods are tested to find the proper dimensions for the LSA process: the percent variance method and the profile likelihood method. The details of these two methods have been discussed previously. In addition, the effectiveness of the α value in the penalty function is evaluated from 0.1 to 1.0 in 0.1 increments. If α = 1.0, the penalty function is not applied.

4.1. Percent variance method

The results for the percent variance test method are shown in Tables 3–5. The precision appears low because our system selects 10 morphological nouns from each passage per morphological cluster. For example, if three morphological solution groups exist in one passage, the system can only extract a maximum of 30 morphological nouns in a passage. The average gold standard per passage is 4.782 regardless of the number of morphological solutions, which lowers the precision values.

Table 3. Precision for various percent variance dimensions and penalty ratio α

Note: The shaded cells indicate the values that have better results than the value that does not apply the penalty function, and the bold values indicate the best values in each dimension reduction.

Table 4. Recall for various percent variance dimensions and penalty ratio α

Table 5. F1-score for various percent variance dimensions and penalty ratio α

The results of the precision and F1 score appear unaffected by dimensions. Even though more correct answers can be retrieved as the dimension increases, the total number of morphological nouns extracted by the system also increases, which offsets the effect of increased correct answers. Similarly, precision appears unaffected largely by the penalty rate, α. The amount of total retrieved solutions may be too large, which causes the precision value according to the penalty rate to be standardized downward.

The recall results indicate trends according to the dimension reduction and the penalty rate. The recall in low dimensions (10%, 20%, or 30%) are lower than the recall in high dimensions. This trend is reasonable considering significant dimension reduction can cause loss of important information. The more important trend is relative to the penalty rate, α. In most cases, as the penalty rate increases, the recall gradually increases. However, recall begins to decrease at a certain point, which indicates that there is an optimal penalty rate if we fix the dimensionality.

The bold values in Tables 3–5 are the best values for each dimension reduction. The correlation between the penalty rate and the best precision value or the best F1 score is weak. However, the correlation between the penalty rate and the best recall value is more apparent. The best recall appears in a penalty rate of 0.7 and a percent variance at 90%. The next best recall value appears in a penalty rate of 0.8 and a percent variance at 60%.

The shaded cells in Tables 3–5 are the values that have better results than the value that does not apply the penalty function (α = 1). As the dimensions in SVD increase, the α value should also increase to obtain a better result for the weighting function that properly reflects the effect of morphological nouns that appear close to the functional keyword verb.

The results of various dimensions are evaluated using the percent variance method to find the proper α value. However, the correlation between α and dimensionality seems arbitrary in the above results. Therefore, the profile likelihood method is tested in the next case.

4.2. Profile likelihood method

The profile likelihood method is used to fix the dimensions optimally and determine the proper α. Figure 6 shows the precision, recall, and F1 score for the various α values. The baseline represents the result where α = 1.

Fig. 6. Result of applying profile likelihood method for singular value decomposition.

Applying the penalty function under the profile likelihood method increases the precision and F1 score. The precision and F1 score are the greatest when α = 0.6. The recall is reduced when applying a penalty function with low α values (i.e., α ≤ 0.4), which is likely due to the overemphasis on morphological nouns near the functional verb. The recall is significantly greater than the baseline, with α values ranging from 0.6 to 0.9. The results from the percent variance method also indicate the greatest recall occurs when α ranges from 0.6 to 0.9, regardless of the dimension reduction. Therefore, the recall results from the profile likelihood method agree with the recall results from the percent variance method.

5. LIMITATIONS AND FUTURE WORK

Our proposed solution retrieval method and text clustering algorithm have limitations. These issues remain as challenges to be addressed before this method can be extended to bioinspired design for the practicing design engineer. The limitations discussed in this section remain areas for future work.

The results of the percent variance method and the profile likelihood method reveal some areas of improvement for the algorithm. The low precision values are due to the amount of noise in the retrieved morphological nouns. To alleviate noise and increase performance of the algorithm, the maximum retrieved nouns per passage must be controlled. Future work includes adopting machine learning algorithms or creating a filter for morphological nouns extracted using WordNet as two possible solutions for reducing noise. In addition, several factors diminish the recall value. The morphological nouns in the golden standard do not always appear in WordNet, because WordNet cannot recognize terminology that is rarely used in everyday English. Some of the gold standards are identified as nonmorphological terms, such as “signal,” because interraters show a tendency to select keywords that are important for conducting functional nouns in the system, rather than focusing on morphology only. In many cases, a function can be achieved by a combination of various elements, including signals, morphologies, or other flows. Thus, incorrectly identifying gold standards as nonmorphological terms can be viewed as a limitation of the morphology-based approach. Integrating an approach similar to Vandevenne et al.’s (Reference Vandevenne, Verhaegen and Duflou2014) work in focus organism detection in biological strategy documents may be useful for improvement of the clustering algorithm. The taxonomic relations between morphological terms may expand the cluster term sets and improve recall. However, the proposed strategy lies beyond the scope of this paper and remains as future work.

The algorithm cannot differentiate between subject morphological words and object morphological words. For example, in a sentence such as “A senses B,” if A and B are both morphological words found by WordNet, the algorithm cannot recognize which morphology is actually conducting the function “sense.” In addition, the developed algorithm can only find morphological nouns. If engineers want to find other types of solutions, such as behavioral solutions, our proposed solution retrieval method and text clustering algorithm are not suitable for that task. We confine the form of morphological solutions to nouns. However, the WordNet noun categorization used in this study is not suitable for other types of solutions, which may require different definition confinement, parts of speech, and filtering standards.

6. CONCEPTUAL DESIGN EXAMPLE: DESIGN AN ANTI-IMPACT FABRIC

A case study is presented to demonstrate the utility of the morphology-based clustering algorithm in a conceptual design effort. The objective of this case study is to create a new conceptual design for shock-absorbing materials that has impact or force dispersion advantages over conventional protective materials such as foam or fiber pads. We employ the proposed morphology-based keyword search algorithm to identify a specific shock-absorbing morphology used in nature that provides inspiration for an engineered shock-absorbing fabric. This fabric is useful for various applications such as protective clothing, floor mats, or construction materials. Fabric dimensions and materials are not restricted considering the conceptual nature of this design exercise.

6.1. Design procedures

The case study follows the design steps using the developed keyword–morphology search tool. Design procedures are shown in Figure 7 to illustrate designer activities and computational activities. The designer activities are as follows. The designer draws a functional model for the design problem. The designer identifies a key function for the morphological search and enters the keyword in the developed algorithm. After the algorithm clusters the biological text, the designer reviews the morphology represented in each cluster. The designer selects the first passage in a cluster and reads it. If the selected passages are interesting to a designer, he or she continues to read and explore the specific morphology used by natural systems to provide the needed functionality. If passages in the cluster are not inspiring for the design problem, the designer skips the current passage cluster and moves on to another passage cluster related to the main function.

Fig. 7. Designer's activities and computational processes using the proposed categorization algorithm.

The black box model for the function “inhibit energy” and the functional model of an anti-impact fabric using the functional basis (Hirtz et al., Reference Hirtz, Stone, McAdams, Szykman and Wood2002) are shown in Figure 8 and Figure 9, respectively. This functional keyword acts as input to the algorithm. In this case, the user chooses “inhibit” from the function “inhibit energy,” which is the design function of interest for the shock absorbing material. Using the E-to-B thesaurus, the algorithm finds all paragraphs in the text that contain the keyword “inhibit” and its biological correspondents, including “cover,” “destroy,” “repress,” and “surround.”

Fig. 8. Black box model of a shock-absorbing material.

Fig. 9. Example of functional model using functional basis, expanding on the main function shown in Figure 8.

After the algorithm collects the paragraphs that contain the key functional verbs, morphological nouns in the text are filtered and the collected morphological nouns and paragraphs are categorized. A portion of the results from the algorithm is presented in Figure 10. The output presents the passage clusters and passage numbers included in each passage cluster. Morphologies in each passage are also provided next to the passage number. After scanning a morphology list, a designer can be directly inspired from the morphologic words or decide to read a passage cluster based on morphologies.

Fig. 10. Portion of algorithm output showing the retrieval of important morphological keywords from each passage.

Passages from passage cluster 1 and passage cluster 4 are presented in Figure 11 and Figure 12, respectively. If one passage cluster contains few passages, this means that the morphological solutions in this passage cluster are shared by few passages. As a result, if an engineer wants relatively common morphological solutions, passage cluster 4 might be a good option. However, if an engineer wants more unique solutions, passage cluster 1 might be a better option than passage cluster 4.

Fig. 11. A paragraph in passage cluster 1 containing the functional keyword. Only one paragraph is contained in passage cluster #1 because of its unique morphologies in the text. Morphologies are highlighted and the functional keyword, “surround,” which is a correspondent biological keyword to “inhibit,” is underlined.

Fig. 12. Portions of example passages grouped by the system in passage cluster 4 among 46 paragraphs containing the functional keyword “inhibit.” Morphologies are highlighted and the functional keyword is underlined.

After reading through the retrieved morphologies, the designer is intrigued by passage cluster 1 and decides to read the entire passage (passage 44), shown in Figure 11. Reviewing the description in passage 44 of the morphology of bone, the designer is able to create analogies between the morphology of bone and the morphology for shock-absorbing materials. The designer is inspired by highlighted morphological keywords, such as compact bone, bone marrow, limb, shaft, and cylinder. The designer is not interested in the morphologies contained in passage cluster 1 and skipped the passages shown in Figure 12. Despite not reading the passages in passage cluster 7, the designer is inspired by morphologies such as covering, membrane, mucus, layer, and plate in the results list shown in Figure 10. After the designer reviews several inspirational words and passages, a conceptual design sketch of the anti-impact fabric is generated as shown in Figure 13. The conceptual design of the shock absorbing material is discussed in the next section.

Fig. 13. Sketch of a design inspired by morphological nouns found in the selected paragraphs in Table 6.

6.2. Redesigned anti-impact pad/fabric

The new design consists of four thin rubber layers with two layers being paired. Distributed rubber cylinders are located between the thin layer pair. Air fills the cavity of hollow cylinder layers.

It is noteworthy that the bioinspired concept for a shock-absorbing material presented in Figure 13 is similar to another bioinspired shock absorbing design. Figure 14 illustrates a shock absorber inspired by a woodpecker (Yoon & Park, Reference Yoon and Park2011). The woodpecker's head can endure impact at high velocities yet suffer no damage. While the design illustrated in Figure 14 is clearly a more defined concept than the case study design, both concepts use multiple layers, cavities, and damping elastic materials. The keyword and morphological search method proposed in this paper is used to find a similar system in nature for design inspiration. Of importance, the designer requires no familiarity with biological systems that provide shock absorption to design the concept. As such, the proposed algorithm allows designers with limited biological knowledge search nature for solutions to design problems.

Fig. 14. Woodpecker head inspired shock absorbing system (Marks, Reference Marks2011; Yoon & Park, Reference Yoon and Park2011).

In this conceptual design example, the application of the keyword morphological search is applied to the design of shock-absorbing material. The tool finds biological morphologies used to solve the function “inhibit” in nature and clusters them based on similar passages. Clustering allows engineers to review morphological strategies in a more organized and focused manner than if the passages were returned in a random or flat manner. Based on the clustered passages, the designer can review one passage and conclude that no further passages in that cluster are necessary for review. Finally, the case study design is compared to the existing bioinspired shock-absorbing design to validate the promise and potential of the proposed algorithm.

7. CONCLUSION

This research effort seeks to improve the text-based concept generation method in bioinspired design using various natural language processing and text mining theories. The research stems from the concept generation tool called keyword search, which bridges the engineering problem domain and the biological solution domain by mapping an engineering functional keyword to a biological correspondent. Though keyword search can convey potential text solutions to engineering designers, inconvenience still exists in this method, such as repeated solutions or a large volume of retrieved text. Our research addresses these drawbacks and delivers biological information effectively by categorizing biological text according to biological solutions within the text. We focus on morphological solutions because they increase the designer's understanding of biological systems and enlarge the engineer's design space, which contributes to creative design. WordNet's noun category list is used to find morphological solutions. LSA and the EM algorithm are adapted to perform the categorization of biological text. We also present a novel penalty function that penalizes a solution far from a functional keyword in a paragraph to reflect function–morphology relationships. The performance of our penalty function is superior to the conventional weight function in this context. The appropriate penalty rate should be between 0.6 and 0.9. The enhanced performance of the algorithm in this range of penalty rate verifies the importance of a function in the solution mining process and our assumption that the degree of function–solution relatedness must be considered in this work. Challenges and limitations of the current study are presented and remain as areas for future work. A case study for the design of shock absorbing materials exemplifies the utility of the developed algorithm. Upon implementation of the morphology-based clustering algorithm, engineering designers can discover innovative solutions in biological text in a straightforward, efficient manner.

ACKNOWLEDGMENTS

Support for this project was provided by National Science Foundation Grant NSF EFRI-ODISSEI 1240483. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Sooyeon Lee received her Bachelor's degree in mechanical engineering from Korea University in 2010. She received her PhD degree in mechanical engineering from Texas A&M University in 2015 under Dr. Daniel A. McAdams. Dr. Lee's dissertation research focused on applying natural language processing to biological text for bioinspired design.

Daniel A. McAdams is a Professor of mechanical engineering in the Department of Mechanical Engineering and Graduate Program Director at Texas A&M University. He teaches undergraduate courses in design methods, biologically inspired design, and machine element design and graduate courses in product design and dynamics. Dr. McAdams research interests are in the area of design theory and methodology with specific focus on functional modeling, innovation in concept synthesis, biologically inspired design methods, inclusive design, and technology evolution as applied to product design. He has edited a book on biologically inspired design.

Elissa Morris received her Bachelor's degree in mechanical engineering from the University of Texas at San Antonio in 2011. She is currently a PhD student at Texas A&M University under advisement of Dr. Daniel A. McAdams. Her research interests include bioinspired design for self-assembling systems, origami-inspired design, and engineering education.

APPENDIX A

Pseudocode of the algorithm

References

REFERENCES

Bingham, E., & Mannila, H. (2001). Random projection in dimensionality reduction: applications to image and text data. Proc. 7th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pp. 245–250. San Francisco, CA: ACM.CrossRef Google Scholar

Chakrabarti, A., Sarkar, P., Leelavathamma, B., & Nataraju, B. (2005). A functional representation for aiding biomimetic and artificial inspiration of new ideas. Artificial Intelligence for Engineering Design, Analysis and Manufacturing 19(2), 113–132.CrossRef Google Scholar

Cheong, H., Chiu, I., Shu, L., Stone, R., & McAdams, D. (2011). Biologically meaningful keywords for functional terms of the functional basis. Journal of Mechanical Design 133(2), 021007.Google Scholar

Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39(1), 1–38.Google Scholar

Gero, J.S., & Kazakov, V. (1999). Adapting evolutionary computing for exploration in creative designing. Proc. Computational Models of Creative Design IV, pp. 175–186. Sydney, Australia: University of Sydney, Key Centre of Design Computing and Cognition.Google Scholar

Globerson, A., & Tishby, N. (2003). Sufficient dimensionality reduction. Journal of Machine Learning Research 3, 1307–1331.Google Scholar

Hacco, E., & Shu, L. (2002). Biomimetic concept generation applied to design for remanufacture. Proc. ASME Design Engineering Technical Conf., Vol. 3, pp. 239–246. New York: ASME.Google Scholar

Hirtz, J., Stone, R.B., McAdams, D.A., Szykman, S., & Wood, K.L. (2002). A functional basis for engineering design: reconciling and evolving previous efforts. Research in Engineering Design 13(2), 65–82.CrossRef Google Scholar

Jolliffe, I. (2002). Principal Component Analysis. New York: Wiley Online Library.Google Scholar

Kaiser, M., Hashemi Farzaneh, H., & Lindemann, U. (2014). Bioscrabble—the role of different types of search terms when searching for biological inspiration in biological research articles. Proc. DESIGN 2014 13th Int. Design Conf., Dubrovnik, Croatia, May 19–22.Google Scholar

Ke, J., Wallace, J., & Shu, L. (2009). Supporting biomimetic design through categorization of natural-language keyword-search results. Proc. ASME 2009 Int. Design Engineering Technical Conf./Computers and Information in Engineering Conf., pp. 775–784. San Diego, CA: ASME.Google Scholar

Laham, T., & Foltz, P. (1998). Learning human-like knowledge by singular value decomposition: a progress report. Advances in Neural Information Processing Systems 10: Proc. 1997 Conf., p. 45. Cambridge, MA: MIT Press.Google Scholar

Landauer, T.K., Foltz, P.W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes 25(2–3), 259–284.CrossRef Google Scholar

Little, A., Wood, K., & McAdams, D. (1997). Functional analysis: a fundamental empirical study for reverse engineering, benchmarking, and redesign. Proc. 1997 Design Engineering Technical Conf., Paper No. 97-DETC/DTM-3879, Sacramento, CA.Google Scholar

Manning, C.D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge: Cambridge University Press.Google Scholar

Manning, C.D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press.Google Scholar

Marks, P. (2011). Woodpecker inspires shock absorbers. New Scientist 209(2798), 21.Google Scholar

Nagel, J.K., Stone, R.B., & McAdams, D.A. (2010). An engineering-to-biology thesaurus for engineering design. Proc. ASME 2010 Int. Design Engineering Technical Conf./Computers and Information in Engineering Conf., pp. 117–128. New York: ASME.Google Scholar

Nakov, P., Popova, A., & Mateev, P. (2001). Weight functions impact on LSA performance. Proc. EuroConference RANLP, pp. 187–193, Tzigov Chark, Bulgaria, September 5–7.Google Scholar

Purves, W.K., Orians, G.H., Sadava, D., & Heller, H.C. (2003). Life: The Science of Biology: Vol. 3. Plants and Animals. New York: Macmillan.Google Scholar

Qian, L., & Gero, J. (1992). A design support system using analogy. Proc. Artificial Intelligence in Design'92, pp. 795–813. New York: Springer.Google Scholar

Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5), 513–523.Google Scholar

Schmid, H. (2013). Probabilistic part-of-speech tagging using decision trees. In New Methods in Language Processing (Jones, D.B., & Somers, H., Eds.), pp. 154–164. New York: Routledge.Google Scholar

Shu, L., Ueda, K., Chiu, I., & Cheong, H. (2011). Biologically inspired design. CIRP Annals—Manufacturing Technology 60(2), 673–693.Google Scholar

Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(2), 411–423.Google Scholar

Tsenn, J., McAdams, D.A., & Linsey, J.S. (2015). A comparison of mechanical engineering and biology students’ ideation and bioinspired design abilities. Proc. Design Computing and Cognition '14, pp. 645–662. New York: Springer.Google Scholar

Vandevenne, D., Verhaegen, P.-A., Dewulf, S., & Duflou, J.R. (2016). SEABIRD: scalable search for systematic biologically inspired design. Artificial Intelligence for Engineering Design, Analysis and Manufacturing 30(1), 78–95.Google Scholar

Vandevenne, D., Verhaegen, P.-A., & Duflou, R.J. (2014). Mention and focus organism detection and their applications for scalable systematic bio-ideation tools. Journal of Mechanical Design 136(11), 111104.Google Scholar

Vattam, S., Wiltgen, B., Helms, M., Goel, A.K., & Yen, J. (2011). DANE: fostering creativity in and through biologically inspired design. Proc. Design Creativity 2010 (Taura, T., & Nagai, Y., Eds.), pp. 115–122. London: Springer.Google Scholar

Villalon, J., & Calvo, R.A. (2009). Single document semantic spaces. Proc. 8th Australasian Data Mining Conf., Vol. 101, pp. 175–181. Melbourne: Australian Computer Society.Google Scholar

Wild, F., Stahl, C., Stermsek, G., & Neumann, G. (2005). Parameters driving effectiveness of automated essay scoring with LSA. Proc. 12th Int. Conf. Artificial Intelligence in Education, July 18–22, 2005, Amsterdam, The Netherlands.Google Scholar

Yoon, S.-H., & Park, S. (2011). A mechanical analysis of woodpecker drumming and its application to shock-absorbing systems. Bioinspiration & Biomimetics 6(1), 016003.Google Scholar

Zhu, M., & Ghodsi, A. (2006). Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics & Data Analysis 51(2), 918–930.CrossRef Google Scholar

Fig. 1. Example eigenvalues of a diagonal matrix when the penalty function is α = 0.7.

Fig. 2. Computational process of the developed clustering algorithm.

Table 1. Noun categories in WordNet

Fig. 3. Decomposition of morphology definition: shape and structure. Categories representing “structure” are selected based on noun categories related to the definition of structure in WordNet.