An analysis of the legal risks of generative-AI data training


The first instance of a copyright infringement case involving a generative AI drawing was declared recently, triggering another round of heated discussion among academics and the industry around the copyrightability of AI generators. In addition to this inconclusive issue, the issue of whether the data training phase of generative AI infringes upon rights holders’ interests is also controversial. This chapter presents a summary of the topic.

What are the legal risks in generative AI data training?

Article 7 of the Interim Measures for the Administration of Generative Artificial Intelligence Services lists the relevant specific requirements for the training data of generative AI service providers as follows:

  1. use data and base models with legitimate sources;
  2. where intellectual property rights are involved, the intellectual property rights enjoyed by others in accordance with the law shall not be infringed;
  3. where personal information is involved, the consent of the individual shall be obtained or in accordance with other circumstances stipulated by laws and administrative regulations;
  4. effective measures shall be taken to improve the quality of training data and enhance the authenticity, accuracy, objectivity and diversity of training data;
  5. pay attention to other relevant provisions of laws and administrative regulations such as the Network Security Law of the People’s Republic of China, the Data Security Law of the People’s Republic of China, the Personal Information Protection Law of the People’s Republic of China and other supervisory requirements of the relevant authorities.

The following sections will elaborate on items (1) to (3).

Using data with legitimate sources.

In practice, the illegitimacy of a source is mostly manifested through the improper obtaining of data, such as by means of a database breach. This constitutes unfair competition and is regulated by the Anti-Unfair Competition Law. The relevant jurisprudence is laid out in the following table of unfair competition disputes.

Case number Key points of the decision Applicable law
  • (2017) Yue 03 Min Chu No. 822
  • Shenzhen Gumi Technology Co, Ltd and Wuhan Yuanguang Technology Co, Ltd and others
The defendant Yuanguang company used python technology to obtain and use the plaintiff’s ‘Coobus’ software real-time bus information data. This was ‘reaping without sowing’ or ‘stealing others’ work’, and an illegal occupation of another’s intangible property rights and interests. It destroyed other parties’ competitive advantage and their own subjective intent to gain a competitive advantage, violated the principles of honesty and good faith, and disrupted the order of competition. Article 2 of the former Anti-Unfair Competition Law
  • (2018) Zhe 01Min Zhong No. 7312
  • Taobao (China) Software Co Ltd and Anhui Meijing Information Technology Co Ltd
Meijing Company violated business ethics and the good-faith principle by inducing Taobao Company’s business intelligence users to share their accounts in violation of the law. It thereby improperly obtained data in which Taobao Company had invested a lot of manpower and resources to obtain, research and develop, and then distributed it for profit. Its behaviour disturbed the order of market competition and caused damage to the legitimate rights and interests of Taobao Company. Article 2 of the Anti-Unfair Competition Law
  • (2018) Zhe 8601 Min Chu No. 956
  • Hangzhou Zhizhang Technology Co, Ltd, Hangzhou Lidao Technology Co, Ltd et al. and Zhejiang Zhongfu Network Technology Co
The dealer database in question had a positive effect. After obtaining the data of the dealers concerned by the improper means of database breach, Zhejiang Zhongfu Company subjectively had the intention of ‘free-riding’ and ‘reaping without sowing’ by homogenising the services provided by the two websites concerned. Article 2 of the Anti-Unfair Competition Law
  • (2020) Zhe 01Min Zhong No. 5889
  • Shenzhen Tencent Computer System Co, Ltd and Tencent Technology (Shenzhen) Co, Ltd v Zhejiang Soudao Network Technology Co, Ltd and Hangzhou Juketong Technology Co

The data controlled by the network operator is divided into original data and derivative data. For single raw data, the data control subject can only rely on the network user’s information rights and interests, and enjoys the limited right to use the raw data according to the agreement between the user and the data control subject; for the data resources as a whole aggregated from a single raw data, the data control subject enjoys competitive rights and interests.

The unauthorised use of a single raw data controlled by others without violating the principles of ‘legitimacy, necessity, and consent of users’ should not generally be recognised as unfair competition; the unauthorised destructive use of data resources controlled by others on a large scale can be recognised as unfair competition.

Unauthorised innovative competition based on others’ existing data resources should comply with the principles of ‘legitimacy, proportionality, consent of users and efficiency’. If a so-called ‘innovative competitive result’ does more harm than benefit in terms of market competitive effects, it should be recognised as unfair.

Article 2 of the Anti-Unfair Competition Law

Some scholars suggest that Articles 49 and 53 of the Copyright Law do provide for lawful means of access. However, the third paragraph of Article 49 clearly defines that:

the technical measures in this Law refer to effective technologies, devices or components used to prevent or restrict the browsing and enjoyment of works, performances, sound and video recordings without the permission of the right holder, or to make works, performances, sound and video recordings available to the public through information networks.

Because most cases related to generative AI service providers do not involve the provision of relevant works as is, it appears that Article 49 may not apply.

Situations involving intellectual property rights

This requirement stems from Article 7 item (ii): ‘Where intellectual property rights are involved, the intellectual property rights enjoyed by others in accordance with law shall not be infringed.’ The data training phase of generative AI usually involves data mining, and the process of digitising non-electronic data may constitute an infringement of the right of reproduction. This is especially true in the case of permanent reproduction.

There is currently no litigation in China related to fair use in the case of generative AI. The closest approximation is Wang Xin v Google. Here, the court of first instance held that the act of full-text copying belonged to the act of copying as stipulated in the Copyright Law, and that ‘the act of full-text copying has conflicted with the normal utilisation of the plaintiff’s work, and will unreasonably damage the legitimate interests of the copyright owner, and this act of copying does not constitute a fair use act, and has constituted an infringement of the plaintiff’s copyright’. The court of second instance upheld the original judgment but referred to the United States’ fair use ‘four elements’ method, mentioning that ‘although unauthorised copying constitutes infringement in principle, copying specifically for the purpose of fair use should be viewed in conjunction with the subsequent use of the act, which may also constitute a fair use’. At the same time, it mentioned that ‘the determination of fair use except for the specific circumstances stipulated in Article 22 of the Copyright Law should be strictly controlled’. In this case, Google did not submit evidence on whether the copying behaviour constituted fair use. Therefore, there was insufficient evidence to support its claim that the copying itself constituted fair use. The courts of first and second instance differed slightly in their determination of fair use.

Article 24 of China’s Copyright Law provides for 12 specific circumstances of fair use in the form of enumeration, as well as the ‘other circumstances’ provision. Generative AI data training is difficult to categorise as one of these 12 specific cases, but the 13th provision reserves space for the judgement of fair use. The method of judging the ‘four elements’ is also mentioned in Article 8 of the Opinions of the Supreme People’s Court on Giving Full Play to the Functions of Intellectual Property Judgement to Promote the Great Development and Prosperity of Socialist Culture and Promote the Independent and Coordinated Development of the Economy.

However, several scholars have observed that the legislation does not make specific provisions for this new situation. They note that this lack of legal clarity may lead to a series of shortcomings, as follows:

The court seems to be suspected of breaking through the copyright provisions, and often mixes the ‘three-step test’ and the ‘four-element method’ in judgment, and the result of the judgment is often unpredictable.

Such intentional omission of the clarification of the type of fair use in adjudication poses a significant risk in terms of legality. At a time when the AI industry is developing rapidly, more and more cases of use of works are likely to arise in the future, and if there is no clear legislative definition of the nature of the behavior of AI deep learning, it is feared that a large number of lawsuits will be induced, which is not conducive to the healthy development of the Internet industry.

In the national judicial trial of the determination of the standard hybrid and too arbitrary concept of transplantation frequently appear, as well as different jurisdictions copyright exceptions, the flexibility and stability of the dispute is not yet decided, China’s copyright fair use judicial determination standard presents differently in different courts is not surprising.

Some scholars therefore advocate that generative AI data training be included in the category of fair use. The amended Copyright Law removes the obstacles to AI data training. For example, Xu Xiaoben divides the value of data into the original value and the value of knowledge added after analysis, arguing that machine learning does not involve the original value of the work and that ‘people will not evaluate the value of the process of AI deep learning itself, but only after outputting the content can they judge whether there is any value’. As for the value of the knowledge added by machine learning, it is argued that machine analysis will not present the original work as it is and that the value-added knowledge it creates is independent of the original value of the work. This value-added knowledge will not affect the original value of the work and market interest, so the copyright owner’s right to control the use of AI and to share its value-added benefits does not have a legitimate basis. The deep learning behaviour of AI can be defined under fair use in the copyright system.

Similarly, Jiao Heping distinguishes the use of works according to the dichotomy of ‘expressive use’ and ‘non-expressive use’. However, he believes that ‘non-expressive use’ can be employed as ‘transformative use’ in a defence, but ‘expressive use’ still faces the risk of infringement. However, in terms of value considerations, the system should be responded to by incorporating the use of AI data into the category of fair use. Lin Xiuqin proposed that ‘the “author-centrism” and strict “three-step test” of traditional copyright law cannot adapt to the needs of AI technological changes. In order to promote innovation and the development of AI technology, fair use should be expanded and the system should be reshaped.’ Liu Youhua mentions that ‘the harsh protection model of the copyright system will limit the development of machine learning technology’. At the same time, ‘the lax copyright protection model will inhibit the enthusiasm of authors to create’, and ‘At present, it is not appropriate to completely exclude machine learning from the fair use system, nor can it be completely included, but should be specifically analyzed for the specific circumstances of machine learning.’ Specifically, a distinction should be made between commercial and non-commercial uses.

Similarly, some scholars have affirmed judicial discretion. Cong Lixian and others believe that ‘the fair use of the underpinning clause is a more feasible way to solve in the judicial path. However, as a limitation of the right should not be excessive ‘open, more feasible practice is in added the three-step test and the United States four-element rule for comprehensive judgment in the case’.

Circumstances involving personal information

With regard to item (3), a typical case to which generative AI service providers can refer is ‘Sina-Maimai’, outlined below. This was a case of unfair competition concerning the illegal capturing and use of Weibo user information. It established the ‘principle of triple authorisation’.

Case number Key points of the decision Applicable law
  • (2016) Jing 73 Min Zhong 588
  • Beijing Taoyou Tianxia Technology Co, Ltd et al. and Beijing Weimengchuangke Network Technology Co, Ltd
In the Open API development cooperation model, the precondition for the data provider to open data to the third party is that the data provider obtains the user’s consent. At the same time, the third-party platform should clearly inform the user of the purpose, manner and scope of the use of the user’s information and obtain the user’s consent again. In the Open API development cooperation model, the third party should therefore adhere to the triple authorisation principle of ‘user authorisation’ plus ‘platform authorisation’ plus ‘user authorisation’ when obtaining user information through Open API.

Article 29 of the Law on the Protection of Consumer Rights and Interests of the People’s Republic of China

Article 2 of the Decision on Strengthening the Protection of Network Information

This case was named one of the top 10 instances of judicial protection of intellectual property rights in the Beijing courts in 2016, and it has influenced the adjudication of many subsequent similar cases. Article 23 of the Personal Information Protection Law responds to the principle of triple authorisation:

Where a processor of personal information provides personal information processed by the processor of personal information to other processors of personal information, the personal information processor shall inform the recipient’s designation or name, contact information, purpose of processing, method of processing and type of personal information of the receiving party to the individual, and the individual’s consent shall be obtained. The receiving party shall handle personal information within the scope of the aforementioned purposes of handling, methods of handling and types of personal information. If the receiving party changes the original purpose and method of processing, it shall obtain the consent of the individual again in accordance with the provisions of this Law.

There are different views in the academic and industrial sectors on the principle of triple authorisation. On the positive side, Xue Jun believes that triple authorisation better balances the interests of all parties and is ‘of guiding significance for the future protection of personal information and the healthy development of the data and information industry in China’. Opposing views, such as that of Xu Juan, analyse the decision-making of enterprises under the game equilibrium model. They find that the principle of triple authorisation ‘does not conform to the benefit decision-making model’; furthermore, it ‘is not conducive to technological innovation, and there is also the suspicion of pseudo-privacy protection, and the decision-making is not based on the effect of strong market protection’. Taking the compromising view, Xu Wei, for example, believes that the triple authorisation principle should not be universally applied to all data types and that data types involving personal information should be divided into identifiable raw data and non-identifiable derivative data, and that different rules should be adopted for different situations.

Conclusion

The rapid development of emerging technologies such as generative AI has brought a series of challenges to the traditional legal system and has given rise to many different views in both academia and industry. The Interim Measures for the Administration of Generative Artificial Intelligence Services is the latest achievement of China’s legislation in this emerging field, reflecting China’s continued advancement of regulatory strategies for the development of new technologies and applications. Article 7 of the Measures provides clear guidelines for data training for generative AI service providers. The relevant legal system may be further improved in the future, and the interpretation of the relevant rules may be further clarified and specific. Relevant parties should pay close attention.


Endnotes

We will be happy to hear your thoughts

Leave a reply

Som2ny Network
Logo
Register New Account
Compare items
  • Total (0)
Compare
0
Shopping cart