Insights + Resources

September 5, 2025

Law in the AI Age – Should AI Training be Fair Play under Copyright Law?

This article is part 3 of our series about ‘Law in the AI age’, where we explore the unauthorised use of copyright works for training purposes by genAI models, and whether the Copyright Act should be amended to specifically allow this.

Part 1 and part 2 of our series explore the legal liability for AI advice, and how users of generative AI platforms cannot use legal professional privilege to prevent disclosure in Court of their online discussions.

Background

In 2024, the New York Times estimated that large language models (LLMs) like ChatGPT and Gemini AI generate 100 billion words per day.[1] A year on and this number has grown exponentially. Based on assumptions of usage and output ratios, ChatGPT-5 itself now has estimated LLM-generated words at 800 billion per day.

As LLMs become increasingly embedded in the fabric of our daily lives, more and more attention is being paid to the practice of LLMs being trained on copyrighted works.

In this article, we consider the usage of copyright works by LLMs for training purposes. Does AI training on copyright works infringe Australian copyright law and, if so, is it in the public interest for the Copyright Act to be amended to permit this.

What is copyright infringement?

The Requirement of Independent Intellectual Effort

Under the Australian Copyright Act 1968 (Cth) (Copyright Act), copyright arises in original literary, dramatic, musical and artistic works that have been reduced to material form, provided that there is sufficient independent intellectual effort by a human creator.

The copyright owner has the exclusive rights set out in section 31 of the Copyright Act to use the copyright material in certain ways and control the use of it by anyone else.  Different exclusive rights apply to different subject matter. Depending on the subject matter, some of the exclusive rights include:

  • The right to control the use and reproduction of the material;
  • The right to put the material online;
  • The right to make copies available to the public for the first time; and
  • The right to perform the material in public.

A copyright owner has the right to licence others to use their copyright, or to transfer/assign it.

Copyright infringement occurs when work is used in one of the exclusive ways controlled by the copyright owner, without the copyright owner’s permission.[2] This includes where a ‘substantial part’ of the material is used. However, there are certain exceptions under the Copyright Act, where copying is permitted.

Fair Dealing Exceptions

Under Australian law, there are currently statutory fair dealing exceptions for:

  • Research or study;
  • Criticism or review;
  • Parody or satire;
  • Reporting the news;
  • Providing professional legal advice; and
  • Libraries, educational institutions and government.

Provided that a user comes within a fair dealing exception, they will not breach copyright.

Does AI Training breach copyright?

What is AI Training?

In order to answer a vast multitude of complex questions in informative and sophisticated ways, generative AI platforms are ‘force-fed’ a huge dataset of tokenised text so they can learn statistical patterns or ‘parameters’. This is what we mean by ‘training’.

For GPT-5–class models, data inputs in the terabytes are required, representing decades or centuries worth of human writing. The input dataset is a mix of publicly available text, licensed content, and data created by human trainers. As we wrote about in our earlier article, ‘The Rise of the AI Deities: As Thorny Issues Cluster’, this includes large volumes of copyright-protected materials which are fed into the model without the knowledge or consent of IP owners or data subjects.

In the training process, data inputs from copyright works are tokenised into numbers that represent words or sub-words. For example: “The cat sat on the mat” → [523, 1442, 781, 399, 523, 1121]”. Instead of memorising whole works, LLMs learn probabilistic relationships – how likely one token is to follow another, how words combine into grammar and how concepts are linked across contexts (e.g. photosynthesis appears with plants and sunlight).

Today, LLMs have billions of parameters, so they need billions or trillions of tokens to train effectively. Parameters are weights and biases inside the neural network that control how input text is transformed into output text. Parameters store the learned statistical relationships between words, concepts, and patterns. OpenAI’s GPT-4 is estimated to draw on 1.76 trillion ‘parameters’ to perform its tasks[3], though Open AI has not confirmed that figure. The newer GPT-5 likely uses significantly more parameters.

Why AI Training breaches copyright

We first wrote about whether the outputs from LLMs attract copyright protection in our article ChatGPT – Challenges to Ownership and Protection of IP in the Era of the AI Deities’. However, a separate question is whether the LLM training process (to generate the user-facing outputs) breaches copyright. When LLMs train on copyright material, they usually do so without a licence from the copyright owner.

In the training process, the data inputs are ingested and tokenised. For example: “Mr. and Mrs. Dursley of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much”. This famous opening line from Harry Potter and the Philosopher’s Stone is still protected by Australian copyright, being less than 70 years old.  That is true even though this 23-word sentence out of 76,944 words comprises 0.03% of the content of this famous literary work. Yet copyright can arise in this sentence on the basis that is an original expression that captures an essential and distinctive aspect of the work.[4] Very short, ordinary phrases (“she closed the door,” “it was a cloudy day”) usually are not original enough to qualify.

It is interesting that if you ask ChatGPT-5 to repeat the famous opening line from Harry Potter it will not do so, citing copyright concerns: “I can’t generate or supply copyrighted sentences word-for-word unless there’s a clear legal basis (like commentary, criticism, or review) that makes quoting them fair.

Yet does this get it off the copyright hook?

Probably not. Compiling data sets to train AI requires the AI model to at least temporarily reproduce the works, and it is an infringement of copyright to reproduce protected works without authorisation from the copyright holder.

Therefore, regardless of whether they spout verbatim quotes from copyright works, under current Australian law, LLMs prima facie breach the human author’s right to control the use and reproduction of the copyright material. As a consequence, they need to come within one of the other exceptions under Australian law. The current fair dealing exceptions were not designed to apply to the training of AI models, and their application to such uses remains legally uncertain.

Should the Copyright Act be Amended?

The US Fair Use Doctrine

Australian law has a closed list of permitted exceptions to copyright. The US, on the other hand, has an open-ended doctrine of ‘fair use’, under which any purpose can potentially qualify if the use is “fair.” Whilst the US Copyright Act 1976 lists examples similar to the Australian Act, these are illustrative and not exhaustive.

Several reviews have recommended the adoption of the ‘fair use’ doctrine in Australia, including by the 2013 Australian Law Reform Commission and the Productivity Commission. However, this has not occurred.

A New Training Exception?

The fair dealing exceptions under Australian law have not all existed since 1968 when the Copyright Act was made. The parody or satire exception, for example, was introduced in 2006, as were digital reforms for time and formatting shifting of works.

Last month, Atlassian founder Scott Farquar called for a new fair dealing exception to be created in Australia, to allow AI training.[5]

This call echoed the enquiry in the Australian Government Productivity Commission released in August 2025. The ‘Harnessing data and digital Technology’ Interim Report[6] (PC Report) discusses AI training, noting that the training media is ‘often the subject of copyright protection, which means that their use to train AI models requires permission from the copyright holder’.

The PC Report explores the policy options available to address this, including amending the Copyright Act to include a fair dealing exception for text and data mining.

As stated in the PC Report:

[…] the PC is considering whether there is a case for a new fair dealing exception that explicitly covers text and data mining (a ‘TDM exception’). TDM exceptions exist in several comparable overseas jurisdictions (box 1.7).

Such an exception would cover not just AI model training, but all forms of analytical techniques that use machine-read material to identify patterns, trends and other useful information. For example, the use of text and data mining techniques is common in research sectors to produce large datasets that can be interrogated through statistical analysis.

The Productivity Commission is now seeking feedback on what reforms are needed to bring the copyright regime up to date, indicating that legislative reform is now only a matter of time.

Concluding Remarks

The use of copyright materials to train AI models presents global dilemma between two competing interests: On one hand, the interests of copyright holders and the protection of intellectual property rights, and on the other hand, the development of revolutionary technologies such as AI, which rely on reproducing copyrighted materials for its rapid, world-changing advancement.

The rapid transformation ushered in by increasing ubiquity of genAI compels us to ask difficult questions, including whether copyright, in its current form, is fit for purpose in today’s world.

Does your business use artificial intelligence? For specific advice on how to use and commercialise AI in your business, please contact us below. Edwards + Co Legal provide corporate and commercial legal advice to modern Australian businesses.

[1] https://www.nytimes.com/interactive/2024/08/26/upshot/ai-synthetic-data.html.

[2] Copyright Act 1968 (Cth), s 36.

[3] See: https://explodingtopics.com/blog/gpt-parameters.

[4] See: IceTV Pty Ltd v Nine Network Australia Pty Ltd (2009) 239 CLR 458, where the High Court held that the test looks to whether what was taken is a reproduction of an author’s “independent intellectual effort” and “sufficient expenditure of skill and labour.”; and Milwell Pty Ltd v Olympic Amusements Pty Ltd (1999) 46 IPR 445, where the Full Court of the Federal Court said the question is whether the part taken is “vital, essential, material or distinctive.”

[5] ABC’s 7.30 program, 13 August 2025.

[6] https://www.pc.gov.au/inquiries/current/data-digital/interim/data-digital-interim.pdf

Close Btn Created with Sketch.

RECEIVE FREE NEWS + EXCLUSIVE INSIGHTS

Straight to your inbox on legal and business developments set to disrupt and transform.