1. INTRODUCTION
Medical terminology in Germanic and other languages has a large stock of Latin and Greek prefixes, roots and suffixes. By and large, Greek is the language of pathology (the study of diseases) and Latin is the language of anatomy (the structure of the body). In Swedish medical language, two parallel developments can be seen with respect to this terminology. On the one hand, according to Nyman (Reference Nyman2013a:43), the overall use of Latin and Greek terms in medical language appears to have increased since the 1950s. For example, Swedish terms that were common 50–60 years ago, such as sockersjuka, literally: ‘sugar disease’, barnförlamning ‘children's palsy’ and kräfta ‘crayfish’, are nowadays replaced by diabetes, polio and cancer. It seems that there are several reasons for this: Latin and Greek terms are precise, largely void of expressive meaning, easily adaptable into Swedish linguistic patterns, and often have direct correspondences in English and other languages. Even spin-offs of these kinds of terms into the general language are gaining ground, for example, ‘traffic infarct’ and ‘corporate anorexia’. On the other hand, the Health Record Act (patientjournallagen), adopted in 1985, brought about the first regulation on Swedification and standardization of foreign medical vocabulary, motivated by a demand for transparency and patient empowerment. Swedification (försvenskning) here means adaptation to Swedish spelling and inflection. This can be contrasted with translation, which means forming an equivalent using Swedish vocabulary. For example, Swedification of bronchitis gives bronkit, whereas translation gives luftrörskatarr (Fogelberg & Petersson Reference Fogelberg and Petersson2013:12).
Although spelling of Latin and Greek vocabulary according to Swedish conventions was regulated in 1987 (Smedby Reference Smedby1991, Reference Smedby2013:185), adherence to this in the medical community has not been univocal. As a result, the overall spelling variation has rather increased, with differences depending on medical profession, medical domain, and also the kind of Latin and Greek morphemes involved. In addition to this, there is a strong influence from the English spelling of Latin and Greek, resulting in a mixture of Latin, Greek, English and Swedish spellings, sometimes in the same word. The combinatorics of these influences is enormous, giving rise to huge numbers of spelling variants of the same terms (Grigonytė et al. Reference Grigonytė, Kvist, Velupillai, Wirén, Williams, Siddharthan and Nenkova2014). This in turn constitutes a problem for laypersons trying to look up terms from clinical text and a serious obstacle for automatic language processing for the purpose of simplification, normalization and text mining.
This paper is a case study in the terminological variation in the domain of health records. A health record contains systematic documentation of a single patient's medical history across time, entered by healthcare professionals with the purpose of enabling informed care. The language in this domain, which we refer to as clinical text, is produced by people who are, on the one hand, highly specialised professionals, but on the other hand are non-professional writers, giving rise to a genre which is highly interesting from a linguistic viewpoint. The goal of the study is to obtain precise quantitative measures of how the foreign terminology is manifested in Swedish clinical text. To this end, we shall study the effects of spelling influences along three dimensions: different Latin and Greek prefixes and suffixes, different medical professions, and different medical subspecialties. The rationale for confining the study to prefixes and suffixes is that the behaviour of these can be exhaustively analysed since they constitute a closed class of morphemes; at the same time, they combine with different stems in a highly productive way. As baselines for comparison, we use general medical language from a Swedish medical journal (Läkartidningen) and from a public website dedicated to medical counselling (Vårdguiden).
The purpose of the study is to answer the following research questions: How far has the process of Swedification come in the 20 years since the Health Record Act? Has the effect been to actually increase the number of spelling variants instead of standardising them? Are prefixes or suffixes more resistant to Swedification? Do linguistic factors such as position of affixes inside a word, or external factors such as domain or profession of the language user, play a role in the extent to which Swedification progresses? From a theoretical point of view, these questions are related to morphological connectivity and the combinatorial properties of affixes (Hay & Plag Reference Hay and Plag2004, Baayen Reference Baayen and Olson2010). From a descriptive point of view, this study can be seen as an elaboration of Fogelberg & Petersson (Reference Fogelberg and Petersson2013), with qualitative and quantitative detail for several medical genres and types of language users. In addition, the results bear on the development of computational methods for processing of medical text for purposes such as normalization of vocabulary or information retrieval.
2. BACKGROUND
2.1 Swedish clinical text
Medical text as it is found in textbooks and journals, on the one hand, and the language of health records, on the other, are written under different conditions and for different purposes, and therefore differ substantially (Friedman, Kra & Rzhetsky Reference Friedman, Kra and Rzhetsky2002, Smith et al. Reference Smith, Megyesi, Velupillai and Kvist2014). The purpose of medical text is to transfer knowledge, which requires formal, well-structured and correct text, whereas health records are written under time pressure, being used as memory notes or information for the professional team, and seldom corrected by the author. In both of these domains, medical terminology is used to convey information as precisely and concisely as possible; in the case of health records, a key purpose of this is to ensure patient safety.
2.1.1 History and legislation of patient records
To give some context to medical documentation and its history from the perspective of how clinical notes are written, we provide a brief historical and legislative outline. Medical records have been kept in Sweden at least since the 18th century. A 1730 thesis, written in Latin, Historiis moriborum rite consignandis by the Swedish physician Nils Rosén von Rosenstein, states that the purpose of keeping records by doctors is not only to be of use in the care of a patient but also to accumulate knowledge, in line with Hippocrates’ thoughts (Nilsson Reference Nilsson2007). The author also states criteria for the content and structure of the patient record. These early records were mostly written in Latin with the Greek words of pathology. In 1863, the economic logic started to influence the medical content as rules and regulations stated that the recording of the number of operations, hospitalizations and clinical visits was the base for reimbursement. This influence of economic reimbursement is still strong in the construction of electronic health records systems. During the 20th century, the medical record also became a legal document, and as the legal rights of the patients was regulated this would also influence the clinical texts. In Sweden, the habit of suing the doctor for malpractice is not at all as spread as in the USA, for example, but the eminent threat does influence medical professionals further to be thorough and precise in their documentation of given care. For this purpose also, the need of a precise medical terminology is evident.
Today, the documentation is regulated by the Patient Data Act (Socialdepartementet 2008). It is stated that the foremost purpose of patient record documentation is to contribute to good and safe healthcare (Patient Data Act, Chapter 3, §2). However, the legislation also regulates the language to be used in patient records, and has since 1985 included a directive on Swedish as the preferred language. The decree that the records should be written in a language that is comprehensible for the patient has never really been given preference among physicians, as they foremost see the records as a working tool for the professionals, and prioritize the main purpose of safe healthcare (Allvin Reference Allvin2010), hence the use of technical terminology is heavy.
2.1.2 Characterization of clinical language in electronic patient records
The process of transferring patient records from paper documents into electronic records has made it possible to study and develop natural language processing (NLP) tools for information extraction and other useful methods and tools. However, since health records are sensitive texts and protected by confidentiality, the availability of large corpora for scientific studies, for example linguistic studies, is still limited.
The transfer to the electronic media has not led to improvements of the clinical texts in records as much as could have been expected. The possibilities of using the textual documentation for e.g. visualization of clinical events in timelines, tables or other graphs have been surprisingly unexplored. Also, automated documentation support such as free text search, spelling and language checking, and summarization – functions that are common in other documentation systems – has not been applied for health record systems to any greater extent as of yet. The opportunity to transfer data into more structured records, thus enabling automatic functions and statistical evaluation of health care has been explored in some health record systems, for some types of information, but much of the documentation is written as it always has been; in unstructured free-text paragraphs rich in medical terminology.
Characteristics of clinical text are surprisingly similar in different even unrelated languages (Friedman et al. Reference Friedman, Kra and Rzhetsky2002, Surján & Héja Reference Surján and Héja2003, Laippala et al. Reference Laippala, Ginter, Pyysalo and Salakoski2009, Hagège et al. Reference Hagège, Marchal, Gicquel, Darmoni, Pereira, Metzger, Riano, ten Teije, Miksch and Peleg2011, Bretschneider, Zillner & Hammon Reference Bretschneider, Zillner and Hammon2013, Temnikova et al. Reference Temnikova, Nikolova, Baumgartner, Angelova, Cohen, Angelova, Bontcheva and Mitkov2013, Smith et al. Reference Smith, Megyesi, Velupillai and Kvist2014). Several of these characteristics reflect the constant time pressure in healthcare, such as telegraphic text omitting words and frequent use of ad hoc abbreviations. Also, the fact that many physicians use dictaphones for documentation may sometimes contribute to an unusual sentence structure, containing many subordinate clauses – but this is less frequent. Most sentences in health records are very short (less than 11 words on average) and are not transcribed (Smith et al. Reference Smith, Megyesi, Velupillai and Kvist2014). Clinical text is heavy with technical terms, many originating from Latin, Greek or English. The nature of the diagnosis process results in many negated or speculative statements. The omission of subjects leads to information-dense sentences. Moreover, earlier studies of Swedish clinical text report frequent use of verb less sentences, i.e. 63% of sentences in a corpus of radiology reports lacked a main verb (Smith et al. Reference Smith, Megyesi, Velupillai and Kvist2014). This is in line with findings in German and Bulgarian clinical text (Bretschneider et al. Reference Bretschneider, Zillner and Hammon2013, Temnikova et al. Reference Temnikova, Nikolova, Baumgartner, Angelova, Cohen, Angelova, Bontcheva and Mitkov2013).
2.1.3 Clinical subdomains and domain language
There are differences between subdomains of clinical text, e.g. different language use by different healthcare professions, in part owing to different vocabulary due to their diverse chores but also due to varying academic training. Other variations in language use can be seen between subspecialties within the clinical professions (Patterson & Hurdle Reference Patterson and Hurdle2011, Zeng et al. Reference Zeng, Redd, Divita, Jarad, Brandt and Nebeker2011), not only because of different working conditions and tasks, but also on the account of the varying cultures. During medical profession education and training, emphasis on teaching and learning about healthcare documentation lies more on content than on vocabulary, phrasing, and structure, and much of the style is acquired by reading existing records.
Health records documentation differs in content, style and structure depending on the situation and the purpose of the note. For instance, daily notes are written by several clinical professionals such as nurses, physiotherapists and physicians, to report on the patient's progression, for internal use by the health care team in the daily care. Other parts of the records, such as radiology reports and discharge notes, are addressed to physicians in other departments of the hospital or to the patients’ general practitioner, and are commonly more well-structured and written to summarize impressions, progression or directions/recommendations for further care planning. Linguistic and structural differences in Swedish radiology reports and daily notes have earlier been investigated as a study of genres (Kvist & Velupillai Reference Kvist, Velupillai, Kanoulas, Lupu, Clough, Sanderson, Hall, Hanbury and Toms2014, Smith et al. Reference Smith, Megyesi, Velupillai and Kvist2014).
2.2 Swedification of medical terminology
There are a number of international medical vocabularies also available in Swedish – ICD-10,Footnote 1 MeSH,Footnote 2 SNOMED CTFootnote 3 – developed for standardization purposes in the medical domain, and for maintaining guidelines for terminology usage, including preferred spellings of clinical and medical terms. However, there is little overlap in the actual terminology use in the narrative parts of clinical texts and the terms in these terminologies (Skeppstedt, Kvist & Dalianis Reference Skeppstedt, Kvist, Dalianis, Calzolari, Choukri, Declerck, Doğan, Maegaard, Mariani, Moreno, Odijk and Piperidis2012). Similar findings have been shown for the text from a medical scientific corpus (Kokkinakis Reference Kokkinakis, Moen, Andersen, Aarts and Hurlen2011a) and from public health portals (Kokkinakis Reference Kokkinakis, Pedersen, Nešpore and Skadiņa2011b).
2.2.1 Latin and Greek in medical terminology
As mentioned above, a considerable part of the medical terminology originates from Latin and Greek (Baney Reference Banay1948). As with most scientific writing in the 18th century, medical patient records were originally written in Latin (Nilsson Reference Nilsson2007), thereby being internationally comprehensible. Different languages have adapted medical terms differently (Van Hoof Reference Van Hoof and Fischbach1998, Bretschneider et al. Reference Bretschneider, Zillner and Hammon2013). Table 1 shows two examples of the way in which Latin terms have either been adapted or correspond to different terms in English, Swedish, German, French and Spanish. In Swedish, the Latin expressions for diagnoses were used for classification of disorders until 1987. Today, the Swedish medical expressions are used in the ICD-10 terminology, but the Latin expressions are included and kept as a subtitle. Table 2 summarizes Latin and Greek affixes that are common in the medical domain, obtained from Fogelberg & Petersson (Reference Fogelberg and Petersson2013).
Table 1. Terms for two diagnoses in Latin and according to ICD-10 in different languages, and corresponding expressions in general English and Swedish.
Table 2. Affix pairs used in this study (original:Swedified), obtained from Nyman (Reference Nyman2013b, Reference Nymanc, Reference Nymand).
2.2.2 Swedification
The Swedish medical terminology underwent a Swedification of diagnostic expressions in the 1987 update of the Swedish version of the ICD (Smedby Reference Smedby1991). The Swedish National Board of Health and Welfare decided to partly change the terms of traditional Latin- and Greek-rooted words. This included a Swedification of Latin and Greek affixes as well as abandoning the original rules for inflections. The purpose of this was to bring the classification language up to date and mirror the contemporary medical language.
The ambition was originally to go even further in the change of expressions and use the translated or genuine older Swedish expressions. However, it was concluded that a more radical change into Swedish terms would not gain acceptance in the medical professional community and the committee for the Swedish ICD classification settled on a degree of Swedification that would be accepted and used.
2.2.3 Spelling reform
The Swedification of diagnostic terms in 1987 was paralleled by a spelling reform in the Swedish ICD classification. However, it took a few years before the Language Committee of the Swedish Medical Association concurred with these recommendations. The spelling reform affected the Swedified versions of medical terminology expressions, while the original Latin expressions, for example involving diagnoses, anatomical structures or microbiological pathogens, kept the classical Latin spelling. The spelling reform aimed for a spelling compatible with the Swedish spelling rules. In this spelling reform, c and ch pronounced as k was changed to k, ph was changed to f, th to t, and oe was changed to e, see Table 3. For example, the technical term for cholecystitis (inflammation of the gallbladder) is now correctly spelled kolecystit, and oesophagus is spelled esofagus.
Table 3. Transliteration rules according to the 1987 spelling reform.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:9666:20160519113133720-0529:S0332586515000293_tab3.gif?pub-status=live)
According to clinical terminology practice, the author can choose to write a term in either the original multi-word Latin expression, or the Swedified form, and should spell the term accordingly. Thus, the Swedification process does not apply to foreign affixes used in multi-word expressions for anatomical structures (e.g. musculus tibialis posterior, sinus cavernosus), microbiological pathogens (e.g. Staphylococcus aureus) or diagnostic terms (e.g. amaurosis fugax, status epilepticus).
However, the medical community seems to be a conservative group, and the adherence to the spelling rules in clinical practice has been gradual. Furthermore, because the medical literature is predominantly English nowadays, physicians increasingly get exposed to the English spelling of Latin and Greek words rather than the recommended Swedish one. The English medical language has, like many other languages, kept more of the original Latin spelling in medical expressions than the Swedish has. This has in practice resulted in a multitude of alternate spellings of medical terms in Swedish clinical notes. For example, tachycardia (rapid heart) is correctly spelled takykardi in Swedish, but is also frequently found as tachycardi, tachykardi, and takycardi (Kvist et al. Reference Kvist, Skeppstedt, Velupillai, Dalianis, Fensli and Dale2011). The phenomenon of Greek- and Latin-rooted words introducing unusual inflection forms has also been observed in German clinical texts. These words were often used interchangeably with the corresponding German word (Bretschneider et al. Reference Bretschneider, Zillner and Hammon2013).
3. METHODOLOGY FOR DETECTING AFFIX USE IN CLINICAL TEXTS
3.1 Methodological process
For the purpose of providing the statistics of prefix and suffix usage in Swedish clinical texts we use the following processing scheme:
1. Data extraction: token frequency lists
2. Affix string matching
a. Direct string matching + compound splitting
i. initial and non-initial prefixes of words
ii. suffixes as word endings
b. Pairwise-combinations + compound splitting
i. initial and non-initial prefixes of words
ii. suffixes as word endings
3. Expert annotation
4. Result calculation
3.1.1 Data extraction: Token frequency lists
Because of the extremely sensitive nature of the content in the Stockholm EPR corpus, the corpus was tokenized and converted to frequency lists, one for the whole corpus, and one for each subcorpus. Similarly, the comparable corpora were converted to frequency lists. These lists served as the main sources for the study described in this paper. Section 4.1 below describes the tokenization of the corpora.
3.1.2 Affix string matching
We employed substring matching for finding affixes. Prefix matching has two constraints: initial prefixes that are used at the beginning of words (e.g. arthrosknä, kryptococcos, ortopeden), and non-initial prefixes that are used as succeeding prefixes and/or prefixes in compounds (e.g. fiberrhinoskopi, hjärthypertrofi, elektrofysiologisk). Suffix matching is restricted to the endings of words only (e.g. polymyalgi, acidosis, viros).
Two processing alternatives to affix detection were employed: direct (naïve) string matching + compound splitting and pairwise affix matching + compound splitting. Table 4 illustrates what words containing suffixes -itis and -it are detected in a word sample set by using these two methods.
Table 4. Examples of suffix -itis and -it detection with two methods: string matching + compound splitting (second column) and pairwise affix matching + compound splitting (third column). Bold font indicates what words were detected by each method.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:41542:20160519113133720-0529:S0332586515000293_tab4.gif?pub-status=live)
Direct string matching results in high recall, i.e. it will guarantee that all affix instances are found, but it may result in a substantial proportion of false positives, i.e. instances erroneously recognized as containing an affix, for instance due to violated morpheme boundaries (e.g. diuretikamedicin, överarmsmusklernas).
In order to reduce the amount of potential false positives, i.e. an attempt to ensure that the identified words contain actual Latin and Greek affixes, we searched for the pairwise combinations in words of original and Swedified affixes. That is, we limit the substring matching to words that occur with both: original and Swedified affixes, e.g. haematom and hematom. The pairwise-combination matching strategy narrows the observed space of the affix usage by excluding individual words that contain an original or Swedified affix only. In other words, this method means that misspelled variants and/or individual occurrences of one or the other affix type, are excluded in the search space, but it ensures that the detected word pairs contains the exact same word with one Swedified and one original affix.
Additionally, the violation of morpheme boundaries can be improved by compound splitting. For instance, compare two cases: sensori+neuralt and fiber+rinoskopi. The compound splitting that we employ in this study is based on using a large general language Swedish dictionary (The NST Dictionary 2007) and a medical domain dictionary, resulting in a precision of 83.5%, and is described in more detail in Grigonytė et al. (Reference Grigonytė, Kvist, Velupillai, Wirén, Williams, Siddharthan and Nenkova2014).
3.1.3 Expert annotation
The final methodological step employed in this study is a manual review of the resulting word pairs containing original and Swedified affixes. In this step, a senior physician manually reviewed the resulting word pairs to identify false positive affix matches such as Congo (country), mycket (Swe: much), Karina (name), kortet (Swe: the card) – which are regular Swedish words and names, not Latin or Greek – and to identify other potential errors, as well as qualitatively categorize and analyse the results. Due to the time costs involved in manual analysis, this step was only employed on the pairwise-combinations (Sections 5.2–5.6), not on the results obtained after employing the direct matching technique (Section 5.1).
This three-step semi-automatic procedure aims at gaining as high quality of the words containing affixes as possible. Alternatively it could be viable to use an entirely automatic procedure by for instance exchanging the manual inspection with an unsupervised morphological segmenter. The state of the art as known from the Morpho Challenge 2010 (Kurimo et al. Reference Kurimo, Virpioja, Turunen and Lagus2010) has reported the following highest performance for unsupervised segmenters: F = 64.55% for general English and F = 47.64% for general German. To our knowledge these segmenters have not been tested on domain data and therefore we can only hypothetically predict that the expected performance for Swedish clinical data would not be better.
3.1.4 Result calculation
Results on affix usage in Swedish clinical texts are calculated by absolute and normalized proportion values. By absolute we mean that statistics is built upon the absolute numbers of occurrences. For the interpretation of these values, especially with the pairwise-combination method, it is necessary to be aware of the effect of infrequent word pairs being overshadowed by one or several very frequent cases. In order to counteract this effect we also use normalized values by type. This way each word pair in the specific affix group is normalized to have an equal proportion of impact. One- and two-sample z-tests of proportions (p < .0001, two-tailed) are calculated for statistical significance testing.
3.2 Experiments
Nyman (Reference Nyman2013b, Reference Nymanc, Reference Nymand) lists Latin and Greek affixes that are commonly used in the Swedish medical domain. We select the subset of those affixes for which the Swedification rules apply (Table 2 above) and analyse their usage in Swedish clinical text. We conduct experiments for six distinct affix usage patterns.
Two experiments compare clinical affix usage in the notes of Swedish Electronic Health Records (EHR) with two other medical genres (medical publications and medical online forum articles):
• the proportion of original and Swedified affixes and how it compares to the two other medical genres, and
• the difference, if any, in the use of Latin and Greek affixes in Swedish clinical text compared with the other two genres.
Four experiments are conducted to characterize the usage of affixes in clinical EHR text only: affix usage depending on (3) the position in the word, and (4) the length of the affix. Finally, differences of affix usage between (5) clinical professions and (6) clinical subspecialties are calculated.
4. DATA
4.1 Clinical corpus
The corpus used in this study is the Stockholm Electronic Patient Record (EPR) Corpus with data from the years 2006–2010Footnote 4 (Dalianis, Hassel & Velupillai Reference Dalianis, Hassel, Velupillai, Bath, Petterson and Steinschaden2009, Dalianis et al. Reference Dalianis, Hassel, Henriksson, Skeppstedt and Nugues2012). The corpus contains de-identified patient notes documented in the Electronic Health Records (EHR) system used in Stockholm City Council (TakeCareFootnote 5) with the exception of some categories of records, for example from psychiatry and venereology. In this system, clinical notes are written in semi-structured templates, where each clinical department and profession can define specific templates for their purposes, e.g. a template containing headings such as Past medical history, Current status, Assessment. A template can consist of free-text fields (notes) as well as structured entries such as boxes and dropdown menus with predefined values. The notes are written in Swedish, and by different clinical professionals, e.g. physicians, nurses, dieticians. There is no information about author identity (e.g. names) or other individual distinguishing aspects such as age in the corpus, only information about profession type. The TakeCare EHR system did not supply any support for grammar- or spell-checking during the years 2006–2010.
For this study, only the narrative text was used, leaving out structured parts such as laboratory results and code lists, e.g. diagnosis codes and procedures. All written notes were extracted from the entire document collectionFootnote 6 and tokenized using an adapted version of Stagger (Östling Reference Östling2013)Footnote 7 and word frequency lists were created. Only tokens containing alphabetic characters were used, all converted to lowercase. Although only containing frequency lists from this point, we continue to call this corpus the Stockholm EPR corpus, see Table 5 for details.
Table 5. Features of the clinical corpus and its subcorpora (profession ‘nurses’ includes nurses and midwifes, profession ‘assistant nurses’ means nurses without academic training, profession ‘physiotherapy practitioners’ includes physiotherapists, chiropractors, and naprapaths).
4.1.1 Subcorpora
The Stockholm EPR corpus was further divided into two main subcorpora, each with five categories: (i) clinical profession, and (ii) clinical subspecialty (Table 5). Structured data linked to the free text revealed codes for author profession and clinical unit and were used to compile the subcorpora.
The main authors of patient records are physicians and nurses, as can be seen by the corpora sizes in Table 5, but many other clinical professions write progress notes or daily notes of care. Physiotherapists, chiropractors and naprapaths have a common denominator in their focus on anatomical structures and the physiology of the body, and their notes were combined in order to get a sizable corpus. Dieticians have a highly specialized focus of the patients’ dietary needs and related pathologies. To study the influence of academic training, corpora were created for notes written both by nurses with an academic education and by assistant nurses (undersköterskor) without academic training.
The Stockholm EPR Corpus contains free text written at more than 500 medical units, some of which are within the same hospital department. In order to study differences in language use between clinical subspecialties, several medical units were combined to form subdomains designed to reflect diverse subspecialties within Karolinska University Hospital, using records for both inpatients and outpatients. A corpus of operating specialties was compiled by pooling the records from several departments of surgery (general surgery as well as plastic, neuro, and thoracic surgery) and orthopaedic surgery. For the other corpora of subspecialties, records were pooled from several wards and outpatient clinics of the respective departments. Most of the authors are physicians and nurses.
4.2 Comparable corpora
For comparison, we also use data from Läkartidningen and Vårdguiden. Läkartidningen (the journal of the Swedish Medical Association) is a weekly medical scientific and trade-union journal published in Swedish for medical professionals. It contains biomedical scientific publications as well as articles about new medical scientific findings, studies in the pharmaceutical domain, health economic discussions and evaluations, as well as opinion pieces and political discussions. All articles published in Läkartidningen are copyrighted, but an openly available corpus containing randomly assembled sentences taken out of context is accessible through SpråkbankenFootnote 8 (Kokkinakis Reference Kokkinakis, Ananiadou, Cohen, Demner-Fushman and Thompson2012). The Läkartidningen corpus was retrieved in January 2014 from Språkbanken's Korp.Footnote 9
The Vårdguiden corpus contains articles from 1177.se and Vårdguiden.se,Footnote 10 which are national Swedish online search engines and medical knowledge repositories dedicated to health related information, services, queries and discussions for the public, provided by all Swedish health care counties and regions. All entries about diseases, facts and recommendations were downloaded from these websites,Footnote 11 and tokenized using the adapted version of Stagger. A summary of these corpora are presented in Table 6.
Table 6. Features of medical corpora used for comparison to the clinical corpora.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:37024:20160519113133720-0529:S0332586515000293_tab6.gif?pub-status=live)
5. RESULTS AND ANALYSIS OF FINDINGS
We present results and analysis from the experiments on the six different affix usage patterns in Swedish clinical text. For each pattern, we summarize the affix matching methodology and the corpora used for the specific pattern analysis.
5.1 Pattern 1: Latin and Greek affixes in three medical genres
Method: Direct string matching + compound splitting
Data: The Stockholm EPR Corpus, the Läkartidningen corpus, the Vårdguiden corpus
Figures 1 and 2 summarize Latin and Greek suffix and prefix usage in three different corpora. All of the affix pairs occur in the clinical corpus. However some of the affixes do not occur in the comparable corpora, e.g. circum- and cirkum- or cac- and cak- in the Vårdguiden corpus. Notably, several (n = 20) affix pairs are not found in the Vårdguiden corpus at all. This latter corpus is written for the general public, not medical professionals with training in medical terminology. Consequently, Swedified forms are more common than original forms, and Latin endings appear only for original Latin multi-word expressions. The proportion of affixes in both original and Swedified forms in term of absolute values is summarized in Figure 1 and Figure 2.
Figure 1. Latin and Greek prefix usage in three genres: clinical text (the Stockholm EPR corpus), scientific articles (the Läkartidningen corpus), and medical online information articles (the Vårdguiden corpus).
Figure 2. Latin and Greek suffix usage in three genres: clinical text (the Stockholm EPR corpus), scientific articles (the Läkartidningen corpus), and medical online information articles (the Vårdguiden corpus).
The overall usage of Latin and Greek affixes in original form in all three corpora is low. The majority of affixes are used in Swedified forms. The proportions of original form (Latin and Greek) prefix matches in three corpora are 3.4%, 2.1%, and 1.1%. The online forum genre has the lowest proportion of prefixes in original Latin or Greek form. The proportions of original form (Latin and Greek) suffix matches in three corpora are even lower: 0.8%, 0.5%, and 0.3% respectively.
The observable effect of high proportions of some suffixes and prefixes in original forms in the Vårdguiden and the Läkartidningen corpora is due to a very low number of occurrences, e.g. (Greek affixes left, Swedified right):
(1)
Another source of complication for interpreting the results of pairwise affixes is that some regular Swedish inflections are similar to foreign suffixes, e.g. the genitive form in multi-word expressions such as Kaposis sarcom can be mistaken for -osis (in the pair -osis and -os), as Swedish does not use apostrophes for genitive. Example (2) below includes unwanted pairing shown with the number of occurrences (Greek suffix left, Swedified right). This example illustrates an interesting aspect of the guidelines for the Swedification process – the fact that multi-word Latin expressions should be written in their original form – instances which will not be captured through our chosen methodology.
(2)
5.2 Pattern 2: Differences of the usage between Latin and Greek affixes
Method: Pairwise-combinations + compound splitting
Data: The Stockholm EPR Corpus, the Läkartidningen corpus, the Vårdguiden corpus
Results for the difference of usage between Latin and Greek affixes in the three corpora are presented in Table 7. First, the Swedified form irrespectively of the type of affix is strongly preferred to the original form in the Swedish medical domain. These differences are statistically significant at the .001 level according to the 1-sample z-tests of proportions (p < .0001, two-tailed) across all three corpora. Secondly, prefixes are more Swedified than suffixes. This result holds for both Latin and Greek in both the EPR and Läkartidningen corpora; all the differences are statistically significant at the .001 level according to two-sample z-tests of proportions (p < .0001, two-tailed). In the smaller Vårdguiden corpus, where many affix pairs are not present, there are no significant differences at the .05 level. Thirdly, in the EPR corpus Latin prefixes are more Swedified than Greek prefixes, whereas Greek suffixes are more Swedified than Latin suffixes. The differences are again statistically significant at the .001 level according to two-sample z-tests of proportions (p < .0001, two-tailed). On the other hand, there are no significant differences of this kind at the .05 level in the Läkartidningen and Vårdguiden corpora. Fourthly, both the prefixes and suffixes of the EPR corpus are more Swedified than in the Läkartidningen corpus, which are in turn more Swedified than in the Vårdguiden corpus (note, however, again that many affix pairs are not present in the Vårdguiden corpus). The only exception to this concerns Greek suffixes, where the differences are not significant at the .05 level; the other differences are statistically significant at the .001 level according to two-sample z-tests of proportions (p < .0001, two-tailed).
Table 7. The proportions of original and Swedified forms of Latin and Greek affixes in the Stockholm EPR, Läkartidningen and Vårdguiden corpora. The statistics are calculated from absolute numbers of occurrences.
The Swedified version of the adjectival Latin suffixes can be inflected and thus would not be captured by the pairwise-combinations method. For instance, infraorbitalis ‘infraorbital’ (meaning ‘located below the eye socket’) can in Swedish be inflected as infraorbital as well as infraorbitalt, depending on the head word (agreement). Two examples of pairwise combination of adjectives are shown with the number of occurrences (Latin suffix left, Swedified right):
(3)
In some cases word misspellings can have an influence. Consider two examples of misspelings (acustisk and akusticus) found in a pairwise combinations originating from the suffix -icus, where the correct Swedish spelling would be akustisk (Greek left, Swedified right):
(4)
Another source of errors comes from abbreviations, for instance mobil and mobilis, both abbreviations for mobiliserad, yielding pairwise combinations (Latin suffix left, Swedified right):
(5)
The latter type of affix matching error is observed more often with words that appear to contain a Swedified suffix.
5.3 Pattern 3: Affix usage depending on the position in a word
Method: Pairwise-combinations + compound splitting
Data: The Stockholm EPR Corpus
In this section we analyse the impact of the position of an affix in a word. We compare prefixes that occur as the first syllable of a word, prefixes that occur as the second or later syllable of a word and suffixes – the last syllable of a word, see Table 8 for examples.
Table 8. Examples of prefixes and suffixes in different positions of words.
Figures 3a and 4a illustrates which original and Swedified prefixes and suffixes are found in the clinical corpus. These proportions are based on the normalized values by type. Part (a) of Figure 3 displays the percentage for each suffix found in its original or Swedified form. Brachy- for instance is mainly found in its original form, whereas the Swedified makr- is preferred to the Greek macr-. Part (b) of the figure displays proportions for the same original-Swedified prefix pairs when prefixes are found in the non-initial position of a word. The most prominent changes are observed with for instance galact–galakt or aesthes–estes. Part (c) of the figure shows the difference of those changes (increase on the positive axis, decrease on the negative) in percentage for each prefix pair.
Figure 3. Latin and Greek prefixes found in initial and non-initial positions of words: (a) prefixes found as the initial syllable of a word; (b) prefixes found as the non-initial syllable of a word; (c) the difference between the two. Negative bar means decrease, positive bar – increase.
Figure 4. Latin and Greek suffixes found as the proportion of original and Swedified suffixes based on normalized (a) and absolute (b) values.
Figure 4 presents proportions of the found suffixes in normalized values by type and absolute values. The proportion of the Swedified suffix is dominant for most of the suffixes as expressed in absolute values. The suffix graph of the normalized values shows that many not so frequent word types in fact contain original suffixes.
When analysing the pattern of the affix position in a word, we look at initial and non-initial prefixes and suffixes as somewhat ‘equal’ in an abstract way. By doing this we aim at quantitatively identifying whether the position in the word determines how likely the affix is going to be used in its original or Swedified form. Table 9 summarizes our findings in terms of paired words containing original and Swedified affixes.
Table 9. Proportions of original (Orig) and Swedified (Swe) affix positions in a word.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:86362:20160519113133720-0529:S0332586515000293_tab9.gif?pub-status=live)
The findings show that the position in the word does matter for the chance of the affix being used in original or Swedified form. The normalized data reveal that the initial prefixes are found in original form in 34%, non-initial prefixes in 30%, and suffixes in 23% of cases.
5.4 Pattern 4: Latin and Greek affix use depending on the length of the affix
Method: Pairwise-combinations + compound splitting
Data: The Stockholm EPR Corpus
This part of the analysis is motivated by the fact that all affixes become either shorter or keep the same number of characters after the Swedification rules apply. Our initial hypothesis is that using the shorter affixes would result in shorter words, which might be important for saving time when clinical notes are composed. Several studies have described a high prevalence of abbreviations in clinical texts, which support the notion that shorter is better in the clinical domain (Xu, Stetson & Friedman Reference Xu, Stetson, Friedman, Teich, Hripcsak and Suermondt2007, Kvist & Velupillai Reference Kvist, Velupillai, Kanoulas, Lupu, Clough, Sanderson, Hall, Hanbury and Toms2014).
We have split suffixes and prefixes into three groups according to the length (number of characters) of the Swedified affix: 2–3, 4 and 5–7 characters. Table 10 summarizes the usage patterns depending on the length of the affix.
Table 10. Proportions of original (Orig) and Swedified (Swe) affixes found depending on the length of the affix.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:43980:20160519113133720-0529:S0332586515000293_tab10.gif?pub-status=live)
Our findings did not show any correlation depending on the length of the affix. Both the normalized and the absolute values for suffixes and prefixes show no linear dependence related to the affix length. For suffixes however, we can observe a tendency that when suffixes are very long (5–7) the proportion of them becoming Swedified is larger, suggesting that the shorter ending is preferred.
5.5 Pattern 5: Latin and Greek affix by clinical profession
Method: Pairwise-combinations + compound splitting
Data: The Stockholm EPR Corpus
This section analyses how Latin and Greek affixes are used among five clinical professions: physicians, nurses, assistant nurses, physiotherapy practitioners, and dieticians. Table 11 summarizes the proportions of original and Swedified affixes for the five professions.
Table 11. Proportions of original (Orig) and Swedified (Swe) affix found in subcorpora of clinical professions from the Stockholm EPR corpus.
The most prominent pattern is that assistant nurses and dieticians proportionally use more original form prefixes than other professions. In terms of normalized values the most conservative groups of professions are nurses and assistant nurses: 40% and 40% of prefixes and 34% and 40% of suffixes are used in the original Latin and Greek form. Especially for suffixes this is a strong contrast to physicians, i.e. 23% of suffixes are used in the original form.
We interpret it as an effect of two factors: the size of the subcorpus and the language differences among the professions. The language of physicians is packed with domain terminology and abbreviations that are ambiguous, for instance ‘c’ can mean cancer, cell, corpus, circa, and adjective central. The absence of pronouns and verbs is yet another very typical feature (Temnikova et al. Reference Temnikova, Nikolova, Baumgartner, Angelova, Cohen, Angelova, Bontcheva and Mitkov2013, Smith et al. Reference Smith, Megyesi, Velupillai and Kvist2014). The assistant nurses do not have the same academic training as the physicians, which suggests smaller domain vocabulary and the need to express the same concepts in general and thus more verbose language.
5.6 Pattern 6: Latin and Greek affix by clinical subspecialty
Method: Pairwise-combinations + compound splitting
Data: The Stockholm EPR Corpus
In this section, we present an analysis of how Latin and Greek affixes are used among five clinical subdomains: operating specialty, oncology, infection, cardiology, and neurology. Table 12 summarizes the proportions of original and Swedified affixes for each of the five clinical subspecialties.
Table 12. Proportions of original (Orig) and Swedified (Swe) affix found, depending on clinical subspecialty from the Stockholm EPR corpus.
We did not find any strong correlation from the statistics presented in Table 13 related to the professions. In terms of absolute affixes found, the most conservative spelling is within the specialties of oncology and infection. In terms of normalized values, prefixes and suffixes are found in rather similar proportions for all specialties.
Table 13. Differences between medical specialties: proportions of original and Swedified prefixes found in the Stockholm EPR corpus.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:67458:20160519113133720-0529:S0332586515000293_tab13.gif?pub-status=live)
The detailed manual analysis of the results revealed that the vocabulary is strikingly different between each specialty. For instance, in the cardiology subcorpus, there seems to be more progressive use of c/k as in cardi-/kardi- (see Table 13).
We also found a strong lexical preference for some prefixes, like in the case of cor-/kor- these are very often related with words concerning coronary topics (i.e. the vessels in the heart giving angina pectoris or heart attack) but for the surgical specialties it is related to cortison treatments and cortex or other anatomical structures. In the cardiology subcorpus, we find the Swedified form of the prefix kor- as in koronar- as the first compound, in contrary to the surgical subcorpus where words related to the coronary topic with the original form of the cor- suffix over the Swedified kor- are preferred. We interpret it as a possible pattern that applies to the specialty specific vocabulary, i.e. terms that are more frequently used within a specialty tend to be more Swedified, whereas the spelling would be more conservative for less frequently used terms in the vocabulary/specialty.
6. DISCUSSION
A large proportion of the medical terminology originates from Latin and Greek, in Germanic as well as other languages. In Sweden, since the 1980s, there has been a process of Swedification in the medical domain, which has included a spelling reform and modified affix use. This reform has taken time to have an effect in the medical society.
6.1 Linguistic characterization of Swedish clinical text for knowledge extraction
The present study contributes to the linguistic characterization of Swedish clinical language. Such characterization is essential for constructing automated language analysis tools that can be used for knowledge extraction from clinical text. It has previously been found that many words and expressions in Swedish clinical free text cannot be automatically identified by vocabulary matching to established terminologies (Skeppstedt et al. Reference Skeppstedt, Kvist, Dalianis, Calzolari, Choukri, Declerck, Doğan, Maegaard, Mariani, Moreno, Odijk and Piperidis2012, Grigonytė et al. Reference Grigonytė, Kvist, Velupillai, Wirén, Williams, Siddharthan and Nenkova2014). This is in part due to medical jargon and the extensive use of ad hoc abbreviations (Kvist & Velupillai Reference Kvist, Velupillai, Kanoulas, Lupu, Clough, Sanderson, Hall, Hanbury and Toms2014), but also misspellings and foreign words. Also, many words are hybrid words with a spelling being neither Swedish nor Latin or Greek, as a result of the ongoing Swedification and adaptation to new spelling rules. For instance, bronchit (contemporary Swedish: bronkit) is a common hybrid word found in the clinical corpus, originating from bronchitis, losing its suffix but partly keeping an original spelling (ch instead of k). The findings from this study could be used for development of NLP preprocessing tools that need to be adapted to this domain such as syntactic parsers and part-of-speech taggers (Skeppstedt 2013). For instance, the word pairs of Swedified and original affixes along with the information about proportions resulting from this study can be useful for developing term normalization methods that map term variants to uniform concepts. With sophisticated preprocessing tools, resources such as the Stockholm EPR corpus can be used to build useful applications and systems with the goal to improve health care, such as clinical decision support systems, automatic diagnosis coding (Henriksson, Hassel & Kvist Reference Henriksson, Hassel, Kvist, Peleg, Lavrac and Combi2011), text simplification for patient empowerment (Grigonytė et al. Reference Grigonytė, Kvist, Velupillai, Wirén, Williams, Siddharthan and Nenkova2014) and surveillance of adverse events (Tanushi, Kvist & Sparrelid Reference Tanushi, Kvist, Sparrelid, Grana, Toro, Howlett and Jain2014).
6.2 Findings
Both prefixes and suffixes are used in their Swedified form in clinical Swedish text to a very large extent. This pattern remains strong independently of which clinical subcorpus we studied. If contrasted, prefix usage is more conservative than suffix.
As expected, the proportion of Swedified prefixes and suffixes is relatively smaller in clinical texts than in scientific articles and even smaller when compared with medical online information pages. This holds for absolute values and normalized by-type values.
One important factor that would definitely give more insight, but is not covered in this study, is related to misspellings and ad hoc abbreviations, which are abundant in clinical texts, since this type of text is written under time pressure and most often for the purpose of internal healthcare communication. Patient records are seldom corrected after being written. On the other hand, scientific articles and online information pages are reviewed in the process of writing or can even be updated after they have been published, and are written for a broad audience. As an example of a (mis)spelling variation in clinical domain and also demonstrating an obvious need for aggregation of such cases, consider the following pairs with the Greek suffix (left) and the Swedified form (right) (correct spelling in the first pair):
(6)
To study such examples further with respect to Swedification patterns in clinical, scientific and online health information would require a methodology different than that employed in this study. For instance, terms would need to be normalized and mapped as belonging to the same concept, which would require knowledge about which different variants should be mapped to which concept – within and across corpus types. Moreover, for a deeper study of how these different text types compare in the use of Swedification changes in a larger discourse (that is, not explaining only word pairs), would require also taking context into account.
The difference in the findings for Greek and Latin affixes has shown that Greek prefixes in the original form are more common than Latin in terms of normalized values. The suffix pattern is very similar. It should be noted that the set of Greek affixes used in this study was larger than that of Latin.
The affix analysis depending on the position in a word revealed a positive correlation: initial prefixes are found in larger proportion in their original form if compared with non-initial prefixes and suffixes.
A somewhat surprising finding is that in terms of both the normalized and the absolute values for suffixes and prefixes show no linear dependence related to the affix length, apart from very long suffixes (longer than five characters) for which the proportion of Swedified usage increases.
The analysis of affixes in various sets of subcorpora has shown insignificant differences in affixes found in the different subdomains of clinical text on the basis of surface parameters. After a closer examination, we conclude that the vocabulary in the different subcorpora clearly reflects the divergent working tasks of different professionals and different subspecialties. Lexical features of a subdomain language can be used for unsupervised clustering of text (Patterson & Hurdle Reference Patterson and Hurdle2011, Zeng et al. Reference Zeng, Redd, Divita, Jarad, Brandt and Nebeker2011), but these studies do not specifically focus on the usage of terminology with foreign origin. Patterson & Hurdle (Reference Patterson and Hurdle2011) suggest that differences in language use between professionals, which create disjoint sublanguages, influence the creation of NLP tools for clinical text. A tool which relies on term statistics or semantics and is trained on one clinical note type may not work as well on another.
The analysis of the affix use by different healthcare professionals was limited by the methodology of only extracting word pairs. Thus, the higher frequency of original affixes for the assistant nurses (without academic training) may not necessarily reflect a trend of using original Latin/Greek affixes, as we found very few affix pairs for this group of professionals. A possible explanation can be that assistant nurses are unused to write these words, and therefore are unsure of the spelling.
When analysing subcorpora for different medical subspecialties, there are apparent differences in the use of specific expressions within an affix group, as was shown for cardi-/kardi- in Section 5.6. There are, on this level, striking differences both in the number of instances found for different expressions and for affixes found for certain expressions. The findings that the cardiology subspecialty uses Swedified prefixes for expressions specific to their line of profession is in contrast to the findings for the surgical specialty, where they are more likely to use the original affix chol- for a large number of expressions for gastrointestinal terms, e.g. cholecystitis (gallbladder inflammation) instead of the Swedified kolecystit.
6.3 Limitations
Although this study is based on the largest existing data set of Swedish clinical text available for research, there are some limitations. The pairwise-combination matching strategy narrows the observed space of the affix usage by excluding individual words containing only an original or Swedified affix without a matching word with the other affix. However, with this strategy, we are able to precisely study their usage in combination, and given the very large size of the Stockholm EPR corpus (1.6 billion tokens), we believe that these found combinations reveal a sufficient approximation of Swedification patterns. We intend to further study the number of word types that were missed because of this strategy – word types with either exclusively Latin or Greek spelling, or exclusively Swedified spelling – and analyse whether or not we find additional patterns through this.
In the frame of this study it was not possible to perform a manual review of the results from direct matching (pattern 1 described in Section 5.1 above). It also has to be noted that the state-of-the-art processing tools (like morphological segmenters and part-of-speech taggers) were not applicable in this study because their performance is currently not meeting the required level for this domain. That partially stems from the low lexical coverage as there are no dictionaries that could deal with at least 50% of the vocabulary used in the Stockholm EPR corpus (i.e. almost four million types, whereas the largest Swedish dictionary resource contains 900,000 types, and the largest Swedish medical domain dictionary contains 500,000 terms). In future studies, we intend to extend the manual review analysis to a larger set in order to be able to quantify how well the employed string matching techniques work for this task.
Furthermore, it was not possible to add a time axis as an additional variable in this study, nor information about author age or other characteristics that would have been informative for understanding changes over time or whether or not there are differences in word usage depending on author age. Since the corpus only covers the years 2006–2010, we suspect that changes over time would not be evident for such a relatively short time period, but will investigate if this information could be extracted and further studied in our future work.
7. CONCLUSIONS
This case study has explored the use of Latin and Greek affixes in medical texts of three types; patient records, scientific medical text and online medical information for laymen. Special attention has been given to different domain languages/subdomains of patient records according to profession and medical specialty. The research has been performed on a very large corpus of Swedish clinical text, the Stockholm EPR Corpus, and compared with medical language from Läkartidningen and Vårdguiden. By studying pair frequencies of Latin or Greek affixes in original and Swedified form in these corpora, we have been able to obtain precise measures of the usage of these affixes in the Swedish medical domain, and characterize this in more detail. We have conducted experiments using several distinct patterns with the aim of explaining the numerous variations of the usage of Latin and Greek affix that are manifested in Swedish medical text.
The results of this study show that to a large extent affixes in clinical text are Swedified. The Swedification of clinical text is, however, less common when compared with other medical domain genres, such as scientific publications and online medical texts for laymen.
We have observed that prefixes are more likely to be preserved than suffixes. This is also correlated with the quantitative study of the affixes related to the position of the word. This general pattern seems to be consistent with the Swedish word formation practice, where the productivity of suffixes is greater than prefixes in the sense that suffixes are more common in absolute terms than prefixes; perhaps this is an indication that suffixes are more likely to be Swedified on the grounds that they are more common.
To our knowledge, this is the first study on a systematic characterization and analysis of the behaviour of Latin and Greek affixes in Swedish medical text.
ACKNOWLEDGEMENTS
The authors wish to express their gratitude to Björn Smedby for kindly reviewing and confirming the history of the Swedification process, to Martin Duneld for excellent technical assistance generating Vårdguiden data, and to anonymous reviewers for their comments. We are grateful to Hercules Dalianis for the initiative of Stockholm EPR Corpus. This work was partially funded by the Vårdal Foundation and Swedish Research Council (350-2012-6658), and supported by Swedish Fulbright Commission and the Swedish Foundation for Strategic Research through the project High-Performance Data Mining for Drug Effect Detection (ref. no. IIS11–0053).