Introduction
The digital processing of data opens new, ground-breaking possibilities in research. Data analysis techniques, generally classified as text and data mining (TDM) techniques, make it possible to draw new information from existing research and data. TDM techniques are also used to develop generative artificial intelligence (AI) systems, including large language models (LLMs). Such models are likely to become standard working tools for efficient text production in the future, and research institutions and libraries are engaging to develop models in national languages as part of their public interest mission to preserve and manage national cultural heritage and languages.Footnote 1
For researchers, TDM methods are powerful tools for understanding and making use of existing data and information. The combination of unlimited “reading” capability and the ability to analyze huge amounts of data using statistical and mathematical methods yields new research results in the forms of new learning and understanding as well as the ability to make more accurate predictions using automated tools. With generative AI, research activities are turning towards the development of AI models, including language models. Research institution libraries are in a unique position to train and develop national language LLMs due to their access to large collections and repositories of literature and textual materials. The public interest mission of publicly financed research institutions and libraries, however, demands that research activities adhere to principles of research ethics and that results are made available to benefit society.
The recognition of public interest in such research activities and the need for research institutions and cultural heritage institutions to engage in TDM activities without legal uncertainty have been recognized by the European Commission. In response, an exception from copyright—that is, database rights and press publishers’ rights in works and data for TDM for scientific research—has been included in EU copyright law with the DSM Directive.Footnote 2 The DSM Directive was drafted and enacted before the potential of generative AI was generally recognized. However, with the enactment of the AI Act on May 21, 2024,Footnote 3 it was made clear that, for the use of copyright-protected works in AI training, the AI Act relies on the TDM provisions in the DSM Directive.Footnote 4
This paper discusses the scope and application of Article 3 of the DSM Directive, which addresses LLM training in research institutions. The discussion will also apply to other generative AI using copyright-protected works for training, such as AI tools for generating music or visual art. The article considers these questions from the perspective of research ethics and principles of open science in publicly financed research. The legal framework considered is European copyright law in the European Union (EU) and European Economic Area (EEA).
The Role of Research Institutions and National Libraries in Training LLMs
Training advanced AI models requires access to vast amounts of data. In natural language processing (NLP), the training materials are texts in the relevant language. National libraries and research libraries are in a unique position to develop innovative and high-quality LLMs due to their access to vast collections of texts and literature. Access to collections and repositories of research articles and materials enables the development of models applying professional language and qualitative research methods. Access to newspapers has proven valuable for training models for sentiment analysis and general knowledge.Footnote 5 Many of these collections are already digitized, although investment may be required for data cleaning and processing for machine-learning purposes. Because of the vast amount of training materials and the complex training required, the costs of developing LLMs for lesser-used national languages may be too high to motivate commercial for-profit developers. For these reasons, there is an important role for research institutions and libraries to play in developing and training LLMs as part of their public interest mission to preserve cultural heritage and national languages.
The public interest mission of research institutions and libraries includes the dissemination of research results for the benefit of society. Research results should be made openly available for private and commercial actors as bases for further research, including for-profit innovation.Footnote 6 An assessment of the legal framework applicable to libraries and research institutions when training LLMs should therefore include the conditions for making the model openly accessible. This adds to the complexity of applying EU copyright law, which is fragmented and mainly focuses on the training phase. Furthermore, principles of research ethics and open science go beyond the narrow scope of exception from copyright for scientific research in Article 3 of the DSM Directive, challenging the scope of the exception.
Scope of the Exception in Article 3 of the DSM Directive
TDM and the Development of AI
Machine-learning systems use algorithms and statistical models to draw inferences from patterns in data to make predictions or decisions without being explicitly programmed to do so.Footnote 7 In generative AI, the model will continue its training and development when applied to new data through self-assessment and feedback methods. The information extracted during training is stored in a separate file and added to the model’s set of “rules,” which it will use to make more accurate predictions when applied to new data.Footnote 8 TDM uses similar statistical and mathematical techniques, but the objective is to present the extracted information. Since the definition of TDM in the DSM Directive only covers the process of extracting information and not the use of the information generated, machine-learning processes for AI will mostly be covered by the definition of TDM in the DSM Directive.Footnote 9
Under Article 3 of the DSM Directive, EU Member States are obliged to provide for an exception to copyrightFootnote 10—that is, copyright, database rights,Footnote 11 and press publishers’ rights.Footnote 12 In theory, the exception should pave the way through the layers of exclusive rights in collections and repositories of works for the purpose of TDM in research. The objective of the exception was to permit usage types that were not covered clearly enough by existing EU rules on exceptions and limitations and to harmonize variations in national legislation regarding TDM for research resulting from the optional character of other exceptions and limitations.Footnote 13
Copyright: The Right of Reproduction
Books, articles, papers, poems, and other literary expressions are subject to copyright under European copyright law.Footnote 14 Only the simplest, most elemental texts will not qualify for protection.Footnote 15 Due to the massive amount of text required to train a language model, it can safely be assumed that access to a large amount of copyright-protected works is necessary.Footnote 16 The introduction of the exceptions for TDM in the DSM Directive confirmed the presumption that machine learning will infringe copyright in the training materials.Footnote 17 To assess the legal framework applicable to publicly financed research institutions, it is useful to take a brief look at which phases of development of a natural language model could infringe EU copyright law.
Under EU copyright law, the “temporary or permanent reproduction by any means and in any form, in whole or in part” is reserved for the copyright holder.Footnote 18 The Court of Justice of the European Union (CJEU) has construed the provision broadly, extending the right to every act of reproduction.Footnote 19 All digital copies of a work are considered an act of reproduction, including copies in the RAM or cache memory of a computer, even if such copies may be intrinsic to (lawfully) accessing a work by computer, such as online browsing.Footnote 20 This very broad and formal construction of the right of reproduction has been criticized for going beyond the fundamentals of copyright.Footnote 21 Also, such use does not interfere with the author-audience nexus of free speech and enlightened human communication at copyright’s core.Footnote 22
Already, the compilation of a centralized training corpus would entail several acts of reproduction infringing Article 2 of the Infosoc Directive. To train an LLM on works in a library collection, texts may have to be digitized or transformed from digital human-readable formats such as PDF or similar machine-readable formats.Footnote 23 Pre-processing the data (that is, sorting out outliers and other cleaning of the training data, as well as adding metadata and annotations for training) likely requires temporary copying in the computer’s cache memory, which is also covered by the right of reproduction in Article 2 of the Infosoc Directive.Footnote 24
In the training process, the model goes back and forth between the training data, testing, and modifying its “rules,” yielding statistical information about correlations, trends, differences, and the like in the training data.Footnote 25 This information enables a model in basic form to predict the most likely next word and finally produce large amounts of text. The original works are not recognizable in the model’s stored files, even if the model can be prompted to produce text that is identical or very similar to specific training materials. During the training process, it is therefore likely that copies are created, which are considered reproductions under Article 2 of the Infosoc Directive, even if no human-readable copies are made.Footnote 26 The stored files in the model itself, however, do not necessarily include reproductions of works, and when the model is applied to new data or to produce text, it would in most cases not entail the reproduction of training materials. There may be some variation between models.
Database Rights
Research institutions and libraries also have access to works through subscriptions to other collections. Collections may be protected databases under Article 7(1) of the Database Directive, protecting databases for which “obtaining, verification or presentation of the contents” requires “a substantial investment.” The sui generis right to databases gives the right holder an exclusive right of extraction as defined in the Database Directive, Article 7(2)(a), as “the permanent or temporary transfer of all or a substantial part of the contents of a database to another medium by any means or in any form.”Footnote 27 Using a protected collection for training AI would most likely infringe this right of extraction, as it also covers the “repeated and systematic” extraction of insubstantial parts of the database.Footnote 28 Whereas searching the database may entail digitally copying contents, this does not infringe the extraction right if the consultation is lawful.Footnote 29
The sui generis right is infringed if the investment in making the database is harmed, which it is if the right holder is deprived of revenue that should have enabled the holder to redeem the cost of investment.Footnote 30 Machine learning reveals new knowledge and facilitates new and innovative services, activities that would not likely be characterized as “parasitical competing” activities infringing the sui generis right.Footnote 31 The methods used for machine learning, however, make use of the economic value associated with the database containing a large collection of works. It is therefore unlikely that machine learning would be considered a “normal exploitation” of a database under Article 8(2) of the Database Directive if not explicitly included in a license.Footnote 32 For activities that appropriate the value inherent in the database, the CJEU has found it legitimate for the database holder to reserve a fee in consideration for such use.Footnote 33 It therefore seems likely that the CJEU would find machine learning and AI training to be harmful to the investment in the database that the sui generis right protects.Footnote 34
Research institutions and libraries also hold repositories of works, and such repositories could be subject to database rights. A recent EU regulation points in the direction that, for public institutions, the objective of data transparency would supersede that of protecting database rights in collections of data.Footnote 35 Under Article 1, nr. 6 of the Open Data Directive, public sector bodies are prohibited from exercising their database rights to prevent the reuse of documents or to restrict reuse beyond the limits set by the Open Data Directive.Footnote 36 The Open Data Directive applies to universities as well as bodies governed by public law.Footnote 37 It requires that research data be made openly available, defining research data as digital documents “collected or produced in the course of scientific research activities.”Footnote 38 The definition excludes scientific publications. However, as pointed out in the preamble to the Open Data Directive, research output should already be available under open-access policies. Open access is understood as “the practice of providing online access to research outputs free of charge for the end user and without restrictions on use and reuse beyond the possibility to require acknowledgement of authorship.”Footnote 39 For public research institutions, it could be contrary to their public interest mission of open access to research and research output as expressed in the Open Data Directive to invoke sui generis database rights to restrict the use of open-access repositories for training LLMs, even for commercial purposes. Research institutions engaging in the development of AI or LLMs would also be able to use the repositories of other public research institutions to which they have lawful access for natural language processing activities. Letting the public interest in data transparency prevail over the economic interests of data holders as protected under the sui generis database right is consistent with other recent EU regulations.Footnote 40
Press Publishers’ Rights
The exception in Article 3 also covers the newly introduced press publisher’s right under Article 15 of the DSM Directive. Under Article 15, press publishers are granted full exclusive rights as in Articles 2 and 3(2) of the Infosoc Directive, but only against “the online use of their press publications by information society service providers” and only for a period of two years from publication. An information society service is “any service normally provided for remuneration, at a distance, by electronic means and at the individual request of a recipient of services”—that is, any service provided individually over the internet.Footnote 41 Many research institutions and libraries subscribe to various press publications. While it could be questioned whether the use of press publications by public interest research institutions and libraries for the purpose of scientific research would infringe Article 15, the inclusion of the press publisher’s right in the exception in Article 3 removes any legal uncertainty as to whether these collections are accessible for the development of LLMs within the scope of the exception.
The Condition of “Lawful Access”
The exception in Article 3 of the DSM Directive applies only to works or other subject matter to which the institutions have “lawful access.” This includes the institutions’ own collections, such as repositories of research output and data, as well as digitized collections of lawfully acquired physical copies. Lawful access can be based on a license or a subscription to a collection of works or through an open-access license.Footnote 42 The exception also applies to content that has been made available to the public online without reservation for TDM.Footnote 43 In addition, the exception will cover works that have been donated to the institution as well as works made available to the institution through licensing arrangements required under legislation or as a result of the implementation of limitations to copyright in national law.Footnote 44
The exception covers the research activities of persons “attached” to the organization. In the case of subscriptions, this will also depend on the terms of the subscription agreements.Footnote 45
Finally, the condition of “lawful access” must be distinguished from the concept of “lawful use” in Article 5(1) of the Infosoc Directive. Under this article, a use is lawful if it is either authorized by the right holder or falls outside the scope of the author’s exclusive right.Footnote 46 The CJEU has applied the exception in Article 5(1) to services that have as their output excerpts of the works so small that they do not reproduce the “expression of the intellectual creation of the author.”Footnote 47 A language model trained to produce text will not aim to reproduce the text of the works on which it has been trained but may still happen to do so. For the exceptions in Articles 3 and 4 of the DSM Directive, to fulfill their objective of balancing the interests of right holders against those of users,Footnote 48 the exception should be applied to any acts of reproduction during the development and training of LLMs, but without regard to the output from using the model. Considering the model’s output lawful because the training of the model was lawful would tip the scales. A separate assessment of the output allows for the application of Article 5(1) of the Infosoc Directive when appropriate and a finding of copyright infringement in cases of memorization or the override of paywalls, etc.
Beneficiaries of the Exception in Article 3 of the DSM Directive
“Research organization” is broadly defined in Article 2(1) of the DSM Directive and includes universities, libraries, research institutes, and hospitals that carry out research.Footnote 49 The scope of the exception in Article 3 hinges upon a functional definition of research organizations to allow for the different legal forms and structures of research organizations in EU Member States.Footnote 50 The organization must have as its primary goal to conduct scientific research or to carry out educational activities also involving scientific research.Footnote 51 The organization must be involved in scientific research either “on a not-for-profit basis or by reinvesting all the profits in its scientific research,” or “pursuant to a public interest mission recognized by a Member State” per Article 2(1) of the DSM Directive. A public interest mission will often be reflected in the public funding of universities and their libraries, but it could also be reflected in provisions in national laws or in public contracts.Footnote 52
“Cultural heritage institution” is defined in Article 2(4) as “a publicly accessible library or museum, an archive or a film and audio heritage institution.” It does not matter what types of works or data that the institution holds in its permanent collection. The definition also includes educational establishments, research organizations, and public sector broadcasting organizations.Footnote 53
The exception aims to enhance legal certainty to the benefit of the research community but only insofar as the research is conducted in a way that also adheres to the values of independent and open research. In organizations where commercial undertakings have a decisive influence, allowing such undertakings to exercise control through their shareholders or members, which could result in preferential access to the research results, falls outside the definition of research organizations in the directive.Footnote 54 Some private universities and research organizations are owned by foundations that are either not-for-profit or reinvest profits in research. To benefit from the exception, their statutes or boards should be set up with explicit guarantees against preferential access to research results. The directive also encourages public-private partnerships (PPPs) in research and for public research organizations to use private partners to carry out TDM, including using their technological tools.Footnote 55
Using a broad definition of research organizations provides flexibility for large research initiatives that rely on funding from a combination of sources. The DSM Directive does not require that the organizations be established in the EU. As long as an organization is covered by the definition in Article 2(1), the directive covers research activities taking place within the EU, in line with the stated objective to ensure the EU’s competitive position as a research area.Footnote 56 The DSM Directive has been criticized for excluding startups and individual researchers.Footnote 57 Commercial services may serve the important public interest of information integrity, such as fact-checking services.Footnote 58 However, both startups and commercial service development go beyond the purpose of publicly financed scientific research, and they will not always be able to guarantee adherence to principles of independent and open research. When interpreting Article 3, it should be taken into account that it is part of the EU research policy as anchored in the Treaty on the Functioning of the European Union (TFEU), Article 179.Footnote 59 Integral to EU research policy are principles of independent and open research.Footnote 60 Whether the organization is structured and operates in a way that guarantees these principles, especially regarding the dissemination of research results, could guide the interpretation of the exception in Article 3.
Text and Data Mining “For the Purposes of Scientific Research”
The exception in Article 3 of the DSM Directive applies only to TDM “for the purposes of scientific research.” This includes both the natural sciences and human sciences.Footnote 61 Research activities are characterized by their aim to yield new knowledge. There is no clear definition of when research is scientific, but a basic condition would be that the activity adheres to general principles of methodology and ethics relevant to the subject.Footnote 62 For AI development, like research activities in general, ethical research must respect all applicable national, EU, and international laws.Footnote 63 Ethical guidelines for AI development particularly point to intellectual property rights and the protection of personal data.Footnote 64 Furthermore, the training and development of the model would have to comply with principles of fairness, representativity, transparency, and accountability.Footnote 65 In practice, this means that it should be open to what materials have been used for training and that the researchers are aware of the impact that different types of training materials could have on the model.Footnote 66 Since it is part of the public interest mission of research institutions to disseminate research results to benefit society, the fact that a language model is developed with the goal of releasing it into society should not exclude the application of the exception in Article 3 of the DSM Directive.
Actions Covered by the Exception in Article 3 of the DSM Directive
The exception in Article 3 is mandatory and cannot be overridden by contract.Footnote 67 The exception is also made without payment to right holders.Footnote 68 This means that Member States may not envisage a compensation requirement when implementing the exception.Footnote 69 However, there is some concern that the prices for subscriptions and licenses will increase or that publishers and database owners will employ licensing strategies with differentiated prices for TDM.Footnote 70 In addition to driving up the price of research for public interest, this may also negatively affect the quality of AI models.Footnote 71 Price and license restrictions may lead institutions to train their models with materials of poorer quality or a smaller data set. This could be problematic when developing AI models in research, as it could challenge research standards and methodology. The general principle of EU law effectiveness could probably be invoked to support enforcement action against contract practices that deprive the Article 3 exception of its effectiveness.Footnote 72 However, such litigation would likely be time-consuming and costly due to the legal as well as factual uncertainty.
Right holders are allowed to apply measures to ensure the security and integrity of networks and databases where the works or data are stored per Article 3(3) of the DSM Directive. Such measures may include Internet Protocol (IP) address validation or user authentication to ensure that only persons having lawful access to the works can access them.Footnote 73 Measures must be proportionate to the risks involved and not exceed what is necessary to ensure the system’s security and integrity.Footnote 74 These measures must not undermine the effective application of the exception.Footnote 75 The DSM Directive is not very specific in terms of what measures might be applied or how to address overly restrictive measures or the circumvention of such measures. Protection measures should be limited to mitigating the security risks connected with the lawful use of the works.Footnote 76 It is likely that protective measures will be defined in best practices or similar measures as encouraged in Article 3(4) of the DSM Directive.
Under Article 3(2), the works may be stored and retained for “the purposes of scientific research.” This includes verification of research results. In practice, this would indicate that downloading and compiling a training corpus specifically for AI training is lawful and that it could be stored for some time, perhaps as long as the research output (that is, the trained AI) is openly accessible and can be used as a basis for further research and therefore needs to be verified. This means that more permanent copying is covered by this exception than by the exception for temporary reproductions in Article 5(1) of the Infosoc Directive.
A further question is whether a training corpus could be reused for developing other models once it is processed, annotated, and stored. It is costly to process a training corpus even if a research institution or library already has access to large collections of works and data. The wording of the DSM Directive does not exclude reusing data if the new research activity is covered by the exception in Article 3. If it would be lawful to process and store the training corpus for the new activity under Article 3, an existing corpus may be reused at least within the same organization. Whether a processed training corpus could be licensed or sold to other research organizations for scientific research covered by Article 3 is doubtful, as licensing or selling a corpus could go beyond the normal exploitation of works and prejudice the interest of database right holders or press publishers’ rights in the corpus.Footnote 77 Article 3 would not cover the licensing of a training corpus to commercial actors. Such use would have to be considered under Article 4 of the DSM Directive; see later in this paper for a more in-depth discussion of this topic.
If copies are kept, they should be stored in a secured environment, and the DSM Directive nudges Member States to appoint trusted bodies for managing repositories.Footnote 78 However, any such requirements should be proportionate and not go beyond what is needed for retaining the copies in a safe manner and preventing unauthorized use.Footnote 79
Finally, Article 7(2) of the DSM Directive prescribes that the three-step test in Article 5(5) of the Infosoc Directive shall apply.Footnote 80 Accordingly, the exceptions in the DSM Directive may only apply to special cases that don’t conflict with the normal exploitation of the work and do not unreasonably prejudice the right holder’s legitimate interests.Footnote 81 The CJEU has repeatedly stressed that the exceptions from copyright must be interpreted strictly.Footnote 82 The Court has also stated that the conditions for exception must be interpreted so as to “enable the effectiveness of the exception thereby established to be safeguarded and permit observance of the exception’s purpose.”Footnote 83 Copyright should not be detrimental to the development of new technologies, and the Court has emphasized the need to strike a fair balance between the right holder’s interests and the users of works that implement new technologies.Footnote 84 In recent case law, the CJEU has pointed to online media users’ right to information and characterized this right as a fundamental right enshrined in the EU Charter of Fundamental Rights, Article 11.Footnote 85 This means that internet users must be able to trust that website publishers have fulfilled their obligation to obtain sufficient consent to online use from individual right holders.Footnote 86
Applied to Article 3 of the DSM Directive, there is, on the one hand, the public interest in innovation based on access to information that would call for a broad interpretation of the exception. On the other hand, for the public interest in the right to information and information integrity, represented by LLM users, the exception must be interpreted in a way that ensures that the model can be lawfully used without further rights clearance. In theory, this may include users’ interests in the interpretation of Article 3 of the DSM Directive. However, it remains to be seen how the CJEU will approach the three-step test under Article 3, especially since the balancing of interests is more complex with the inclusion of database rights and press publishers’ rights under the exception. These rights protect interests in the investments in the collection, which differ from the right holders’ interests in individual works.Footnote 87
Open Access to Research Output and Article 4 of the DSM Directive
When the exception in Article 3 of the DSM Directive is directed at research organizations and libraries carrying out scientific research pursuant to a public interest mission, the institutions follow EU principles on open science and dissemination of research results, which are anchored in the TFEU, Articles 179 through 183. The principles of open science and open-access dissemination of research output are mandatory under the EU Horizon programs,Footnote 88 a practice that national research councils have also adopted. Open science and open dissemination of results is implemented in the strategy of most publicly financed research organizations.Footnote 89 The relevant organizations will also be held accountable for adhering to general principles on research integrity and research ethics.Footnote 90 Specialized integrity principles have been developed for the use and development of AI in the research process.Footnote 91
This section looks at how language models developed by research institutions may be disseminated in line with the above-mentioned principles, while not extending beyond the scope of Article 3 of the DSM Directive. Activities that go beyond the scope of Article 3 must comply with Article 4 of the DSM Directive, most notably that right holders must have a real opportunity to reserve the use of their works in the development of the AI model.
An initial question is whether the model, once trained, could be made openly available for the public to use, including for commercial purposes. For example, a model trained on the works available in a national law library could be very attractive not only to other researchers but to courts, ministries, and commercial players, such as law firms and industry, along with the general public. Regarding the individual works on which the model has been trained, this hinges on whether the model itself contains reproductions of sufficient parts of works.Footnote 92 This comes down to the model’s technical features and whether the file containing the experiences from the training process contains parts of the training materials.Footnote 93 If the model does not contain parts of the training materials amounting to a work’s reproduction, making it openly accessible for public use, including for commercial purposes, should be lawful under Article 3 of the DSM Directive even without complying with the obligation to allow right holders to “opt out” per Article 4 of the DSM Directive. When the lawfulness of the model’s further use hinges on whether a reproduction is made, the model may also be further trained and developed for commercial purposes if this process is compliant with Article 4. “Article 3-materials” may not be used for training for commercial purposes without the right holders having had the opportunity to “opt out.”
For database rights, problems arise for those that are privately held, as public research institutions’ obligations vis-à-vis open access and open data likely prevent them from invoking database rights against commercial developers of AI.Footnote 94
For privately held database rights and press publishers’ rights, infringement hinges upon whether commercial use of the model entails copying to such an extent that right holders’ investments in the production of the content are undermined.Footnote 95 While the CJEU has been less concerned with the amount of copying when assessing infringement of database rights, there is still a basic requirement that the model’s use also entails accessing the collections or that the model has stored sufficient information during training to amount to infringement under Article 7 of the Database Directive or Article 15 of the DSM Directive. This seems unlikely.
The press publisher’s right is limited to two years, and the training of the model could therefore possibly be designed to avoid infringement. Database rights last for fifteen years from the database’s completion, and since many press publishers are likely to have overlapping database rights in their collections,Footnote 96 adherence to Article 4 of the DSM Directive might still be necessary.
Problems related to the model’s output text fall outside the scope of Articles 3 and 4 of the DSM Directive. For the press publisher’s right, the EU Commission has explicitly stated that this right does not extend to facts.Footnote 97 Output from fact-finding services or information service LLMs could therefore present facts from press publications. LLMs may be able to produce text that infringes copyright by “memorization” or by circumventing paywalls. Such output is more natural to assess individually for copyright infringement. As AI models producing “creative” content become more sophisticated, it may be called into question whether right holders’ interests are unreasonably prejudiced if the model produces text and creative content that might not infringe individual rights but erode right holders’ economic interests.Footnote 98 These questions point to the importance of research involving AI to adhere to ethical principles. Under the EU Digital Strategy, the development of trustworthy AI requires that the model is lawful, respecting all applicable laws and regulations, including in the development phase.Footnote 99 Responsible use of AI in research requires the respect of rights (that is, intellectual property rights) as well as accountability for the whole research process.Footnote 100
Findings
While standards for ethical research demand respect for intellectual property rights, the discussion in this article has shown that research activities involving the training of language models in compliance with copyright law present new challenges to research ethics and methodology.
First, ethically sound information management must consider technology’s social effects, such as bias, diversity, non-discrimination, and fairness, thereby respecting fundamental rights.Footnote 101 Compiling a training corpus must be done with regard to these standards, an assessment that might be quite complicated.Footnote 102
Second, there is still uncertainty as to whether the DSM Directive will efficiently guarantee access to materials protected by intellectual property rights for training language models in public interest research. Data transparency prerogatives in recent EU legislationFootnote 103 support better access to databases and collections of works. However, the DSM Directive does not foresee specific enforcement action against contractual or commercial practices that restrict access to protected materials for research purposes, such as prohibitively expensive licensing or differentiated subscriptions. While data transparency and accountability are values generally promoted in legislation, regulation is fragmented and serves different objectives, and legal uncertainty remains.
Finally, there is still legal uncertainty regarding the dissemination of results (that is, the use of a model for different societal purposes). The EU Commission did not consider the inclusion of the public interest mission of public research institutions and libraries—that is, making research output openly available in society when drafting the exceptions for TDM in the DSM Directive. The narrow focus of the exception on the training phase causes legal uncertainty regarding the dissemination phase.Footnote 104 The mere use of a developed model to produce new text rarely requires revisiting and the (digital) copying of the training materials. If someone wants to develop and train the model further, a more nuanced assessment is necessary. That a model can be used for further training by commercial undertakings is consistent with the public interest mission of public research institutions but falls outside the scope of the exception in Article 3 of the DSM Directive. The lawfulness of such training depends on whether new (digital) copies of “Article 3-training materials” are made, which again depends on the model’s technical specifications. Retraining the model, including on the original “Article 3-training materials” would require that the conditions in Article 4 of the DSM Directive are met, notably that right holders have been given the opportunity to “opt out.” It could be questioned whether pinning open science to the choice of machine-learning methods could potentially lead to an imbalance when weighing interests while choosing methods for designing and training AI models. If copyright law is given too much weight, methodologies may be chosen that have less consideration for other important values, such as bias, non-discrimination, and fairness.