*This article is based on a lecture given at the Macnica Data・AI Forum 2024 Autumn held in October 2024.
With the remarkable development of IT, the opportunities to handle data have increased dramatically, but at the same time, data management has become more important than ever before.
In this article, Makoto Uchida, a lawyer and patent attorney at iCraft Law Office, explains the legal nature of data and the clauses and points to be aware of when entering into contracts in the first half, and the points to be aware of and legal issues when using services that use generative AI in the second half.
Legal issues regarding the use of data
This time, there are two main themes. First, I will explain the first theme, "Legal issues regarding the use of data."
There is no concept of "my data"
Let's take the example of a case where someone's wallet has been stolen. In this case, the person who originally had the wallet owns it, so they can exercise a right to demand the return of the wallet based on their ownership, or a right to demand that the thief "stop using it in that way." Ownership only applies to tangible objects, so data, which is intangible, is not included. Therefore, they cannot make claims based on ownership such as "give me my data back" or "stop using my data in that way." This is the legal nature of data based on a correct understanding of civil law.
For example, suppose that company A has data stored on its server and company B requests that they use the data. Even if company B accesses the server and uses the data in a way that company A did not intend, company A does not own the data and cannot say to company B, "I'm sorry, I don't allow you to use it that way."
However, there are exceptions when unexpected use of data is restricted by contract or other means. Conversely, it can be said that Company A has no choice but to restrict it by contract. Even if you think it is natural to say, "Please stop using my data without permission (assuming that you have ownership)," it is a very important point that in reality, your request may not be accepted and you may end up with a major failure.
On the other hand, some people believe that if data can be protected by intellectual property law, which provides exclusive rights under the law, then injunctive relief may also be possible. However, copyrights only protect things that are creative. The standard for determining whether or not something is creative is whether or not the individuality of the person who created the content is expressed. For example, data that is merely a string of numbers is considered to have no creativity.
Next is patent rights. Data itself is not an invention, so in principle it cannot be protected by patent rights. However, something called a data structure, which produces one result in response to one instruction, can be protected as a quasi-program. However, as the name "quasi-program" suggests, data structures do not cover the kind of data you might imagine. Furthermore, data itself cannot be protected by design rights, which cover designs.
The only exceptions are trade secrets and limited-use data under the Unfair Competition Prevention Act. However, if Company A discloses a trade secret to Company B, it does not constitute a trade secret unless the two companies have signed a contract containing a confidentiality obligation. This is because trade secrets have three requirements: confidentiality, non-publicity, and usefulness. If the data is not provided to third parties or used for purposes other than those intended, it will not be confidential and will not meet the requirements of non-publicity. In other words, it is a misinterpretation to say that "a contract is not necessary because data can be protected under the Unfair Competition Prevention Act."
It is very rare for data to be protected by copyright, but there are exceptions when the data has creative qualities. Data that is considered to have creative qualities includes music, non-short text, photographs, paintings, etc. However, just because it is music does not mean it has creative qualities, and the presence or absence of creativity must be judged on an individual basis.
Another question that is often asked is, "Aren't databases considered works of authorship?" It is important to understand that "the copyright of a database and the copyright of the individual data contained within it are separate entities," and that "just because a database is a work of authorship does not mean that individual data are protected."
Differences in the meaning of licenses
We will explain some common misconceptions that people who are accustomed to dealing with intellectual property rights tend to fall into. First, intellectual property rights are governed by comprehensive prohibited rights established by law. For example, in the case of patent rights, only the patent holder may implement the invention for which the right is established, and implementation by third parties other than the patent holder is prohibited. Therefore, licensing intellectual property rights can be thought of as lifting this comprehensive prohibited action. Licensing intellectual property rights is the act of opening up a hole in the area covered by the comprehensive prohibited right.
In the case of data, there is no ownership, and unlike intellectual property rights, there is no comprehensive right of prohibition based on the law. Therefore, it is a "feel free to use" situation for those who have access to the data. In that case, a data license has the opposite content to an intellectual property rights license, and instead of lifting a comprehensive prohibition, it "creates" prohibited acts in the contract for acts prohibited by the data provider. Specifically, it specifies prohibited acts such as duplication, provision to a third party, and use for purposes other than those intended, and restricts use.
What data usage conditions should be set?
From here, we will look at how to define the conditions and prohibitions on data use. First, when it comes to conditions to be determined in a data contract, I think it is common to create contract clauses that separate the target data from the derived data that has been processed.
The contract terms will stipulate provisions such as prohibiting use for purposes other than those intended and prohibiting provision to third parties, but there is one point of caution.
The data provider processes the target data for the data user, and the data user creates derived data, but the data provider will not naturally be able to disclose the derived data unless it is stipulated in the contract.
This is another consequence of the lack of ownership of data.
Here is another example where misunderstandings are likely to occur. Company A provided data to Company B based on a contract. Company B used that data to create its own learning model. Company A then said, "Since you used our data, all or part of the rights (copyright, etc.) of the learning model should belong to us." However, there is no mention of this in the contract.
In this case, Company A cannot have rights to the learning model. For example, handing over the learning data does not mean that Company A has made a creative contribution to the creation of the learning model's program copyright. However, this just means that Company A cannot legally assert rights to the learning model, and there is no problem with writing about the rights in the contract. Legally, this means that the copyright and other rights that once belonged to Company B are transferred in whole or in part to Company A by contract.
Legal Issues Regarding Generative AI
From here, I will talk about the second topic, generative AI, focusing on copyright issues.
Piracy by generative AI
Let's look at a case that has become a problem in society. First, let's say artist A paints painting A. If painting A is creative, it is naturally a copyrighted work of artist A. However, if artist A publishes painting A on the internet, an AI developer could collect that data, copy it without permission, and train the model. The first point of contention is whether or not the act of copying at the training stage constitutes copyright infringement.
An AI developer or another business uses the trained model to provide a service. Then, when user X uses the service, he or she sends a prompt such as "Draw a picture in the style of artist A," and the model draws painting B accordingly. If the content of painting A and painting B are similar at this usage (output) stage, this could also be copyright infringement. This is the second point of contention.
After that, let's say that copyright infringement is found for Painting B. Who is responsible? Certainly, it is User X who causes the output action by prompting. However, it can also be said that "copyright infringement occurs because this model exists," in which case the infringer is the AI developer or the service provider. In other words, the third point of contention is, "Who is the perpetrator of copyright infringement at the usage (output) stage?"
Below, we will discuss these three points separately.
Copyright infringement during the learning phase (point 1)
Whether or not copying or adapting a learning dataset based on a work of art or music and training a model constitutes copyright infringement is determined based on Article 30-4 of the Copyright Act. If it meets three requirements (1. the purpose is not to enjoy ideas or emotions, 2. the limit is deemed necessary, and 3. the interests of the copyright holder are not unduly harmed), it will not be copyright infringement.
The "non-enjoyment purposes" stated in requirement 1 is determined based on whether or not the act is directed at obtaining the benefit of satisfying the intellectual or mental needs of the viewer or other person through the viewing or viewing of the work, etc.
In the case of model learning, the computer extracts features and parameterizes them, but of course emotions such as "the content of this copyrighted work is great" do not enter into the process. In other words, the act of training a copyrighted work itself can be said to be for non-enjoyment purposes (meeting requirement 1, which is deemed not to be copyright infringement).
In addition, Article 30-4 of the Copyright Act lists information analysis as an example of a non-enjoyment purpose. Since creating learning data is in principle a form of information analysis, it would normally fall under requirement 1.
On the other hand, there are some cases where requirement 1 is not met (where it is deemed that there is an intent to enjoy the property).
One of these is "additional learning with the purpose of intentionally outputting a creative expression of a copyrighted work contained in the training data." For example, if the training data contains a painting by artist A, and the act of manipulating parameters to output that specific painting or training data for that purpose has the purpose of enjoyment and may be considered copyright infringement.
The other is "learning aimed at outputting products that directly convey the creative expression of a specific work." For example, if you want to output pictures of a popular character, you might have the machine learn only pictures of that character.
Generally speaking, models that are used for general purposes and cannot be printed to target a specific artist's work are not intended for recreational purposes and are likely to be deemed not to be copyright infringement. Note that requirement 2 is only about necessity, so we will not discuss it here.
Let me explain requirement 3, "Do not unduly harm the interests of the copyright holder." For example, an AI developer trains a model using a painting by artist A. After that, a user inputs a prompt into the model, and the model outputs a painting that artist A would likely draw. This is an action that may result in artist A's painting being output in the future, during the learning stage. From artist A's perspective, there is a risk that his or her paintings may no longer sell, so there is a view that this learning action may unduly harm the interests of the copyright holder (artist A). There are a relatively large number of researchers who hold this view.
There are various opinions on such cases, but whether Artist A's picture is output depends on the prompt, and it should not be possible to know at the learning stage whether the copyright holder's rights will be infringed in the future. Therefore, I believe that the act of copying or adapting at the learning stage itself does not meet requirement 3 because there is a possibility that copyright infringement may occur at the future use stage. If Artist A were to sue for copyright infringement, my view is that he should do so at the stage when something similar to his picture is output after entering the prompt, or when the output content is used.
Copyright infringement at the use (output) stage (point 2)
The above is the learning stage, and the rest is the usage stage. Copyright infringement occurs when the following requirements are met:
- 1) The plaintiff is the copyright owner
- ②Copyrightability
- ③Identity or similarity
- ④ A legal act of use (copying or translation) has been performed.
- (1) Reliance
- (2) Reproduction
Of these, the most important issue in the issue of copyright infringement in generative AI is "reliance" as written in ④(1). Reliance means "coming into contact with someone else's work and using it in your own work." In simple terms that the general public can understand, it means looking at something and imitating it.
However, since it is difficult to judge whether or not someone has "seen" something, the criterion for judgment is often whether or not the work may have actually been accessed. If the target copyrighted work is included in the training data, it may be considered that the work has been accessed (reliance is present). Another criterion is whether or not the user who enters the prompt was aware of the existence of the copyrighted work.
However, under the current thinking of the Agency for Cultural Affairs, if the user is aware of the existence of the copyrighted work, it is considered reliant even if it is not included in the learning data. Of course, this is not the court's thinking, so it is not a definitive idea, but I feel uncomfortable with it. This is because even if the user inputs a prompt such as "Draw a picture in the style of artist A," if the model has not learned artist A's drawings, it will have no effect on the output at all.
However, if the user provides a painting by Artist A as a model (for example, by entering a painting by Artist A into the prompt, or by instructing the user to search for a painting by Artist A on the Internet), or if the prompt includes instructions that include a specific title, such as "Draw a painting similar to Artist A's work titled XX," then reliance should be found.
Furthermore, even if the training data contains copyrighted work, it is considered unreliable if measures are taken at the generation stage to prevent images that are identical or similar to a specific image from being output, such as the generation AI returning a message saying "that image cannot be generated."
Copyright infringers at the use (output) stage (point 3)
Regarding the third point of contention, "Who committed the infringement?", opinions are divided between the user or the AI service provider (AI developer).
The output from the model varies greatly depending on the content of the prompt. And since it is the user who writes the prompt, I believe that in principle the user should be the infringer. However, service providers who are aware that they frequently produce infringing output but have not taken any particular measures should also be held responsible for the infringement. On the other hand, based on the Rokuraku II case that occurred in the past, there is also the view that AI service providers should be considered as the parties engaging in copyright infringement after evaluating the infringers.
Internal rules for using generative AI
When creating rules for the use of generative AI within a company, it is very important to ensure that they do not end up being just pie in the sky. It is easy to say things like, "Let's check carefully to make sure we don't infringe on rights," but leaving the decision on whether or not rights are infringed to the discretion of a single employee on the ground is completely ineffective.
It is also important to verify the accuracy of the content and require the use of generative AI. If employees who use generative AI in their work are unable to report it to their superiors, and the number of deliverables that are unclear as to whether they were created using generative AI increases, the check mechanism will no longer function properly, as checkers will be required to be overly cautious.
The use of generative AI may or may not infringe on rights depending on the content of the prompt. Therefore, we must not forget to "save the prompt content and output (log storage) in case of future disputes." After talking to various companies, I feel that this is not an easy thing to do, but I think it is something that needs to be thoroughly implemented in the future.

iCraft Law Office Lawyer and Patent Attorney
Mr. Makoto Uchida
He specializes in building intellectual property strategies related to AI and IT, and building legal strategies for personal information and data businesses. He is a member of the working group for the Ministry of Economy, Trade and Industry's "AI and Data Contract Guidelines Review Committee." He has been selected twice in the intellectual property category of the lawyer rankings conducted by the weekly Toyo Keizai magazine.