Can I use copyrighted data to train an AI model without a license?

Legally risky, and increasingly indefensible. Copying a protected work into a training corpus reproduces it under 17 U.S.C. § 106, so you need either permission or a fair use defense. After Thomson Reuters v. Ross (Feb. 2025), a federal court held that training a competing tool on copyrighted material is not transformative, and the US Copyright Office's May 2025 report concluded that commercial training competing in an existing market ordinarily weighs against fair use. A license is the reliable path. Confirm current law with counsel before relying on fair use for any commercial build.

Is training AI on copyrighted data fair use?

Sometimes, but the safe assumption is no for commercial generative AI. Fair use under Section 107 turns on four factors, and recent decisions weigh them against AI training when the use is commercial, the source was pirated, licensing was reasonably available, or whole works were ingested. Courts have drawn a line: analytical text-and-data mining that derives abstract insights is more defensible, while generative training that produces expressive outputs competing with the source is disfavored. Thomson Reuters, Bartz, and Kadrey all applied this framework in 2025.

What is the difference between copyright and database rights?

Copyright protects original expression, so it covers creative content but not raw facts. The EU sui generis database right under Directive 96/9/EC protects the investment in compiling, verifying, or presenting a database, regardless of whether the contents are copyrightable, and lasts 15 years. A spreadsheet of uncopyrightable facts can still be protected by the database right in Europe. This is why a license cleared only for US copyright can still infringe European database rights, and a thorough agreement addresses both by name.

Do I need indemnification in a dataset license?

Yes, if you are the licensee, this is the clause that actually protects you. Warranties promise the data is clean; indemnification makes the licensor defend and pay if a third party sues anyway. Require coverage for IP-infringement, unlawful-collection, and lack-of-authority claims, including defense costs and settlements. Watch the cap: it is often two times annual fees, which may be far below the exposure, so for high-risk data negotiate to carve IP and privacy claims out of the cap. Anthropic's $1.5 billion settlement shows the scale these claims can reach.

What are the legal risks of training on web-scraped data?

Risks stack up fast. Copyright infringement if the scraped content is protected and no license or fair use covers it. Privacy violations if the data includes personal information collected without a lawful basis. Contract breach if a site's terms barred scraping. CNIL's June 2025 guidance accepts legitimate interest for AI training only when scraping filters out children's platforms, health sites, and personal blogs and is documented with a balancing test and DPIA. Scraping from pirate sources is the worst case, as Bartz v. Anthropic demonstrated when LibGen-sourced books surfaced in discovery.

How do I comply with the EU AI Act's training-data requirements?

If you provide a general-purpose AI model, Article 53(1)(d) requires you to publish a summary of training-data content using the AI Office template, an obligation live since August 2, 2025 with enforcement from August 2, 2026. The summary must cover copyrighted sources, licensed sources, and third-party data. Practically, you need your dataset suppliers to disclose their sources, so build that disclosure duty into the license. Non-compliance carries fines up to 15 million euros or 3% of global annual turnover, whichever is greater.

Who owns the outputs a model generates, the licensor or the licensee?

It depends entirely on what the agreement says, which is why output ownership belongs in the grant clause. The common structure gives the licensee ownership or a perpetual license to use outputs while the licensor keeps the underlying data rights. The harder question is infringement risk: outputs can be substantially similar to training inputs and infringe under 17 U.S.C. § 103, the theory behind Andersen v. Stability AI. A buyer should require the licensor to warrant that outputs will not reproduce protectable third-party expression and to indemnify if they do.

Can I fine-tune a model on licensed data, or does the license only cover training?

Only if the grant says so. Training and fine-tuning are distinct uses, and a license to train one model does not automatically permit fine-tuning successor or derivative models on the same data. List training, fine-tuning, evaluation, and inference as separate permitted uses and mark each one, and state whether derivative and successor models are covered. Ambiguity is construed against the drafter, so if you are the licensee and intend to fine-tune, get it in writing.

What is a data provenance file and why does it matter?

It is the documented record of where a dataset came from: sources by URL, database, and API, collection dates, methodology and sampling, known biases and quality metrics, and rights-clearance evidence, increasingly with C2PA metadata. It matters because it is your best contemporaneous proof of good-faith licensing if a class action subpoenas your training set, and in Europe it supplies raw material for the EU AI Act training-data summary. Bartz v. Anthropic showed what happens when discovery exposes the sources and there is no clean record to point to.

Does an open license like CC-BY or ODbL cover AI training, and is that enough?

Usually it permits training, but it is rarely enough for a serious build. CC-BY 4.0 allows commercial reuse with attribution and ODbL allows reuse under share-alike terms, so the use is generally licensed. What they do not give you are negotiated warranties about provenance or any indemnity if the data turns out to be tainted, because nobody stood behind it. For high-value or high-risk datasets, a bespoke agreement with representations and an indemnity is the difference between a download and a defensible pipeline.

Dataset Licensing Agreement: Free Template & AI Training Data License Guide

Key Takeaways

•A dataset licensing agreement grants a licensee the right to use a defined body of data for stated purposes (training, fine-tuning, evaluation, inference) while the licensor keeps ownership of the underlying copyright and database rights.
•Get the grant clause right and spell out each permitted use separately. A license to "train" does not authorize fine-tuning successor models, it does not authorize redistributing the raw data, and it does not authorize serving inference at scale, and courts will read silence against the party that drafted it.
•Warranties and indemnification carry the real money. The licensor should represent that the data does not infringe third-party IP and was collected lawfully, then back that promise with a duty to defend and pay claims, usually capped at a multiple of the license fees.
•US fair use is no longer a reliable shield. In Thomson Reuters v. Ross (Feb. 2025) a federal court held that AI training on copyrighted headnotes was not transformative, and Anthropic settled Bartz for $1.5 billion after pirate-sourced books surfaced in discovery.
•If the dataset touches EU personal data or the model is a general-purpose AI system, the agreement must layer in a GDPR data processing addendum and EU AI Act training-data disclosure obligations, with fines up to 15 million euros or 3% of global turnover for non-compliance.
•A data provenance file (sources, collection dates, rights-clearance evidence) earns its keep as the contemporaneous record that proves good-faith licensing if a class action subpoenas your training set.

Reviewed for accuracy by the document.com legal team. Educational information, not legal advice.

What Is Dataset Licensing Agreement?

A dataset licensing agreement is a contract that lets one party use another party's structured collection of data, most often to train, fine-tune, or evaluate artificial intelligence and machine learning models, without transferring ownership of the data itself. The licensor keeps its copyrights and, in Europe, its sui generis database rights. The licensee gets a defined, limited permission to do specific things with the data and nothing more.

The agreement sits at the intersection of three bodies of law that rarely agree with each other: copyright (does copying the data into a training corpus infringe a reproduction right), data protection (was the personal data inside the dataset collected and processed lawfully), and contract (what did the parties actually promise each other). A clean license answers all three on the same page.

A bespoke license is a different instrument from a simple data purchase or an open-data download. When you grab a dataset off Kaggle or Hugging Face under CC-BY or ODbL, you accept a standardized public license with no negotiated warranties and no indemnity. A bespoke dataset licensing agreement exists precisely because the buyer wants representations about provenance and a party to stand behind them if a third party sues.

The document has become a frontline commercial instrument since 2024. Publishers, image libraries, and data brokers now license content directly to model developers, and developers demand the agreement as proof of a defensible training pipeline. After a federal court rejected the fair use defense in Thomson Reuters v. Ross and Anthropic paid $1.5 billion to settle Bartz, the license stopped being optional risk-management and started being table stakes for any serious AI build.

Why This Matters Now

The fair use safety net tore in February 2025. In Thomson Reuters Enterprise Centre v. Ross Intelligence, the District of Delaware granted summary judgment to Thomson Reuters, holding that training a competing legal-research tool on Westlaw headnotes was not transformative under 17 U.S.C. § 107 because Ross served the same purpose as the original. It was the first federal ruling on AI training fair use, and it went against the AI developer.

Discovery is exposing pirate sources, and the bill is enormous. In Bartz v. Anthropic, the record showed Claude was trained in part on books pulled from LibGen and a pirate library mirror. Anthropic agreed to pay $1.5 billion to a class of authors, with Judge Alsup granting preliminary approval on September 25, 2025. That is the largest known number attached to a training-data dispute, and it landed on a company that called the use research.

The biggest cases are still live, not settled. Authors Guild v. OpenAI (filed September 19, 2023, S.D.N.Y.) and The New York Times v. OpenAI and Microsoft are both in discovery as of mid-2026, with motions to dismiss denied. Andersen v. Stability AI survived the pleadings in August 2024 and is heading toward a September 2026 trial. The licensing market is pricing in this uncertainty right now.

Europe added a regulatory deadline on top of the copyright risk. Under the EU AI Act (Regulation 2024/1689), providers of general-purpose AI models must publish a summary of training data per the AI Office template, with the obligation live since August 2, 2025 and enforcement from August 2, 2026. Non-compliance carries fines up to 15 million euros or 3% of worldwide annual turnover, whichever is greater. A dataset license is now where developers source the disclosures they are required to make.

California raised the privacy stakes the same year. The CPRA right to deletion now reaches training data, and the CPPA's automated decision-making regulations take effect January 1, 2026, adding pre-use notices, opt-out rights, and risk assessments for AI used in employment, housing, lending, and healthcare. A license that ignores deletion obligations can force an expensive retraining later.

The Legal Backbone

17 U.S.C. § 106 and § 107: the reproduction right and fair use

Section 106(1) of the Copyright Act gives the owner the exclusive right to reproduce the work. Copying a protected work into a training corpus makes a copy, which the US Copyright Office in its May 2025 report said "clearly implicates" that right. The usual defense is fair use under Section 107, a four-factor test: the purpose and character of the use, the nature of the work, the amount used, and the effect on the market. Courts ask whether you transformed the work into something new and whether you harmed the market the author could have licensed into. After Thomson Reuters v. Ross, training a model that competes with the source in the same market is a hard sell on factors one, three, and four. A license sidesteps the whole test because you have permission.

17 U.S.C. § 103: derivative works and model outputs

Section 103 covers compilations and derivative works. The Copyright Office's May 2025 analysis flagged that model weights and outputs can themselves infringe where they are substantially similar to training inputs, which is the legal theory behind the style-mimicry claims in Andersen v. Stability AI. For your agreement this means the grant should address outputs explicitly: who owns them, and whether the licensor warrants that outputs generated from the licensed data will not reproduce protectable expression from third parties. Silence here is where downstream liability hides.

EU Database Directive 96/9/EC: the sui generis right

Even where the contents of a database are not copyrightable, EU law grants a separate sui generis right to whoever made a substantial investment in obtaining, verifying, or presenting the contents (Articles 1 to 3), lasting 15 years from completion or first publication. Article 7 lets the right-holder prevent extraction or re-utilization of a substantial part. Article 8(1) lets a lawful user take insubstantial parts, and Article 15 forbids contracting around that floor. So a US copyright clearance does not clear EU database rights. If your data was compiled in or sourced from Europe, the license must address the database right by name, not assume copyright covers it.

EU AI Act (Regulation 2024/1689), Articles 10 and 53(1)(d)

Article 10 sets data-governance rules for high-risk systems: training, validation, and testing datasets must be relevant, sufficiently representative, and as free of errors as possible. Article 53(1)(d) requires general-purpose AI providers to publish a summary of the content used for training, following the AI Office's template, with the duty live from August 2, 2025 (and a safe harbor running to August 2, 2027 for models placed on the market before that date). Enforcement and fines begin August 2, 2026: up to 15 million euros or 3% of global annual turnover. Your dataset license should obligate the licensor to supply the source information the licensee needs to populate that mandatory summary.

GDPR Articles 6 and 17, and the CNIL June 2025 guidance

If the dataset contains personal data of EU residents, every use needs a lawful basis under Article 6. For training on public data, providers usually rely on legitimate interest under Article 6(1)(f). France's CNIL, in guidance published June 2025, accepted that legitimate interest can support AI training, but only if the data collection filters out children's platforms, health sites, and personal blogs, with a documented balancing test and a data protection impact assessment completed before training. Article 17 gives individuals a right to erasure; where deleting one person from a trained model is technically unfeasible, organizations may rely on output filtering and audit trails if they document the rationale. Note the trap CNIL itself flags: GDPR compliance does not certify copyright or contract compliance. They are separate gates.

California CPRA and the CPPA automated decision-making rules

Under the CPRA (Civil Code § 1798.100 and following), the right to deletion can extend to training data, which may force scrubbing or retraining if a consumer exercises it. The CPPA's automated decision-making technology regulations, finalized in 2025 and operative January 1, 2026, add pre-use notices, opt-out rights, and risk assessments for AI used in employment, financial services, housing, healthcare, and education. A dataset licensing agreement that ignores California deletion and opt-out mechanics leaves the licensee holding a model it may have to expensively rebuild.

What goes inside a dataset licensing agreement

Start with the grant, because it defines the entire deal and the rest of the contract just allocates risk around it. A grant clause should name each permitted use as a separate, switchable item: training, fine-tuning, evaluation and benchmarking, and inference serving. These uses are not interchangeable. A license to "train a model" does not let the licensee fine-tune successor or derivative models on the same data, redistribute the raw dataset, or stand up a retrieval system that serves the data back to users. Spell out whether the license covers one named model or its descendants, whether it is perpetual or term-limited, and whether the licensee may sublicense the resulting model to its own customers. Courts construe ambiguity against the drafter, so vagueness here costs the party that wrote it.

Separate the access from the rights. A common structure gives the licensee perpetual rights to use a model already trained on the data, but only finite access to the dataset itself. That split matters because a model, once trained, embeds the value of the data permanently, while continued access to the raw corpus lets the licensee train again. Decide which you are selling. Enterprises generally insist on perpetual model rights because retraining a model after losing data access is rarely feasible and a clawback would be commercially catastrophic.

The warranty stack is where the licensor puts its money where its mouth is. At minimum the licensor should represent that it owns or holds a valid license to the data; that training use will not infringe any third-party copyright, trademark, trade secret, or database right; that the data was collected in compliance with GDPR, CCPA, CPRA, and other applicable privacy law with no opt-out rights outstanding; and that no third-party license attached to the data prohibits AI use. Add an affirmative disclosure of known limitations: data-quality issues, sampling methodology, known biases, and gaps. A knowledge qualifier ("to licensor's knowledge") is reasonable for opt-out invocations the licensor cannot fully audit, paired with a commitment to honor opt-outs going forward using commercially reasonable efforts.

Indemnification turns the warranties into a payable promise. The licensor should agree to defend and indemnify the licensee against third-party claims that the data infringes IP rights, was collected unlawfully, or was licensed without authority. Carve out the licensee's own misuse, unauthorized modification, combination with non-licensed data, and use outside the granted scope, because a licensor cannot insure against what the licensee does on its own. Cover defense costs plus settlements and judgments, require prompt notice and reasonable cooperation, and expect a cap, frequently around two times the annual license fees. A licensee buying high-risk training data should push for a higher cap or a carve-out from the cap for IP and privacy claims, since those are the claims that actually bankrupt people.

Restrictions fence in the permitted use. Prohibit redistribution or public posting of the raw dataset and require controlled access. Add a competitive-use bar if the licensor does not want its data fueling a rival product. Decide commercial use up front: allowed, prohibited, or allowed with an attribution or licensing model for end-user outputs. Confirm that the licensor retains all copyright and database rights and that the licensee receives a limited license only. Set a data-minimization rule so the licensee does not hoard the data past need, with a deletion or destruction timeline keyed to termination.

Layer in the compliance obligations that 2025 and 2026 law now demands. Give the licensor audit rights to verify compliance. Require the licensee to report data breaches or unauthorized access within a fixed window, often 30 days, and to apply industry-standard security with destruction on termination to a recognized standard such as NIST or ISO. If the model is a general-purpose AI system, obligate the licensee to credit the licensor in its EU AI Act training-data summary, and obligate the licensor to provide the source detail that summary requires. If the dataset contains personal data, attach a GDPR-compliant data processing agreement and pin down who is controller and who is processor.

Demand a data provenance file as a contractual deliverable. It should list sources by URL, database, and API; record collection dates, methodology, and sampling strategy; document known biases and quality metrics; and attach rights-clearance evidence such as licenses and permissions for third-party content. Increasingly it carries C2PA content-authenticity metadata signaling rights status. This file is the licensee's single best evidence of good-faith compliance if a court later subpoenas the training set, which is exactly what happened in Bartz. Version it and link each version to the model iteration it trained.

Close with termination and survival. State the term and renewal conditions, allow termination for convenience on notice, and allow termination for breach after a cure period. Address the hard question directly: after termination, the licensee stops new training, but may it keep models already trained on the data, or must it destroy them? Perpetual output rights paired with a stop on new training is the common enterprise outcome. Make warranties and indemnification survive termination, typically for two to three years, because the claim that triggers the indemnity often arrives long after the deal ends.

When You Need This

You are an AI or machine learning developer sourcing third-party text, images, audio, video, or structured records to train, fine-tune, or evaluate a model, and you need contractual proof the data is clean.

You own or compiled a valuable dataset (a publisher's archive, an image library, proprietary sensor or transaction data, an annotated corpus) and want to monetize it for AI training while keeping your copyrights and database rights.

You are placing or fine-tuning a general-purpose AI model on the EU market and must be able to publish the Article 53(1)(d) training-data summary, which requires source detail your data suppliers have to provide.

Your dataset includes personal data of EU residents or California consumers, so you need a license that carries a GDPR data processing addendum and addresses CPRA deletion and CPPA automated-decision obligations.

You are licensing data for a use that competes, or could be argued to compete, with the source's own market, where fair use is weakest after Thomson Reuters v. Ross and a negotiated license is the only safe path.

You are an enterprise buyer who needs indemnification and a paper trail (provenance file, rights-clearance certificate) before your legal team will approve a model trained on outside data.

How to Fill Out Dataset Licensing Agreement

1. Identify the data and verify the chain of rights
Define the dataset precisely: contents, format, volume, and the exact sources. Before drafting a grant, the licensor must confirm it actually holds the rights it is about to license. Check copyright ownership, any upstream licenses attached to third-party content, and, for European data, the separate sui generis database right under Directive 96/9/EC. If personal data is present, document the lawful basis under GDPR Article 6 and whether a CNIL-style balancing test and DPIA were done. Write down what you find. This becomes the provenance file.
2. Draft the grant clause with each use as a separate switch
List the permitted uses one by one: training, fine-tuning, evaluation, inference serving, and (if intended) sublicensing the resulting model. Mark each yes or no. State whether the license covers a single named model or successor and derivative models, whether dataset access is perpetual or term-limited, and whether model rights survive the dataset access ending. Do not let "use the data" stand in for a specific list, because the gaps will be read against whoever drafted them.
3. Build the warranty stack
Have the licensor represent ownership or valid license, non-infringement of copyright, trademark, trade secret, and database rights, privacy-law compliance with no outstanding opt-outs, and the absence of any third-party license barring AI use. Require an affirmative disclosure of known biases, quality issues, sampling method, and gaps. Accept a "to licensor's knowledge" qualifier only where the licensor genuinely cannot audit the fact, and pair it with a forward commitment to honor opt-outs.
4. Negotiate indemnification and the cap
Write the licensor's duty to defend and indemnify against third-party IP, privacy, and authority claims. Carve out licensee misuse, unauthorized modification, out-of-scope use, and combination with non-licensed data. Specify defense costs plus settlement and judgment coverage, prompt notice, and cooperation. Set the cap, commonly around two times annual fees, and for high-risk data negotiate to lift IP and privacy claims above the cap. This clause is the one that pays out, so do not rush it.
5. Set restrictions, retained rights, and data minimization
Prohibit redistribution and public posting of the raw data and require controlled access. Add any competitive-use bar. State that the licensor retains all copyright and database rights and the licensee gets a limited license only. Decide commercial use and any output attribution model. Add a data-minimization rule with a deletion or destruction deadline tied to termination, so the licensee does not retain the corpus past need.
6. Bolt on the compliance layer
Add audit rights, a breach-notification window (often 30 days), and an industry-standard security obligation with NIST- or ISO-grade destruction on termination. If the model is a general-purpose AI system, require the licensee to credit the licensor in its EU AI Act training-data summary and require the licensor to provide the source data that summary needs. If personal data is in scope, attach a GDPR data processing agreement and assign controller and processor roles.
7. Require the provenance file as a deliverable
Make delivery of a data provenance file a contractual condition. It should list every source, collection dates, methodology and sampling, known limitations and quality metrics, and rights-clearance evidence, ideally with C2PA metadata. Tie each provenance version to the model iteration it trained. This is your litigation insurance and, in Europe, partial raw material for the AI Act disclosure.
8. Finalize term, termination, survival, and signatures
Set the term, renewal terms, termination for convenience on notice, and termination for breach after a cure period. Decide explicitly whether already-trained models survive termination or must be destroyed. Make warranties and indemnification survive for two to three years past the end. Confirm governing law, then have authorized representatives of both entities sign and date. Keep the signed agreement, provenance file, and any rights-clearance certificate together as one record.

Key Terms Defined

Sui generis database right: A standalone EU right under Directive 96/9/EC protecting whoever made a substantial investment in obtaining, verifying, or presenting a database's contents, lasting 15 years and existing independently of copyright. It can block extraction of substantial parts even where the underlying facts are not copyrightable, which is why a US copyright clearance does not cover European data.
Fair use (17 U.S.C. § 107): The US defense that allows limited unlicensed use of copyrighted work, judged on four factors: purpose, nature of the work, amount used, and market effect. For AI training its strength fell sharply after Thomson Reuters v. Ross held that training a competing tool on copyrighted material is not transformative when it serves the same market purpose.
General-purpose AI (GPAI) provider: Under the EU AI Act, a provider of a model that can perform a wide range of tasks. These providers must publish a summary of training-data content per the AI Office template, an obligation that flows back to dataset licensors expected to disclose their sources.
Data provenance file: A documented record of a dataset's sources, collection dates, methodology, known limitations, quality metrics, and rights-clearance evidence, often carrying C2PA metadata. It functions as contemporaneous proof of good-faith licensing if a court later subpoenas the training set.
Indemnification cap: A contractual ceiling on how much a party must pay under its indemnity, frequently set at a multiple of the license fees such as two times annual fees. Sophisticated buyers of high-risk training data negotiate to lift IP and privacy claims above the cap because those are the claims capable of exceeding it many times over.
Open data license (CC-BY, ODbL, OpenRAIL, CDLA, C-UDA): Standardized public licenses for datasets. CC-BY 4.0 allows commercial reuse with attribution; ODbL is share-alike and mirrors database-right restrictions; OpenRAIL adds ethical use-case limits; CDLA-Permissive is MIT-like for data; C-UDA permits internal R&D but bars redistribution. None carries negotiated warranties or indemnity, which is why high-value deals use a bespoke agreement instead.

Legal Authorities & Sources

This page is grounded in primary law. The statutes and official resources below are the authorities behind the guidance above. Verify the current text of any statute before relying on it.

Frequently Asked Questions

Ready when you are

Create your Dataset Licensing Agreement in minutes.

Answer a few questions and download a clear, attorney-drafted document that cites the controlling law and is ready to sign.

Create Dataset Licensing Agreement

No account · Free to preview

Dataset Licensing Agreement

Key Takeaways

What Is Dataset Licensing Agreement?

Why This Matters Now

The Legal Backbone

17 U.S.C. § 106 and § 107: the reproduction right and fair use

17 U.S.C. § 103: derivative works and model outputs

EU Database Directive 96/9/EC: the sui generis right

EU AI Act (Regulation 2024/1689), Articles 10 and 53(1)(d)

GDPR Articles 6 and 17, and the CNIL June 2025 guidance

California CPRA and the CPPA automated decision-making rules

What goes inside a dataset licensing agreement

When You Need This

How to Fill Out Dataset Licensing Agreement

1. Identify the data and verify the chain of rights

2. Draft the grant clause with each use as a separate switch

3. Build the warranty stack

4. Negotiate indemnification and the cap

5. Set restrictions, retained rights, and data minimization

6. Bolt on the compliance layer

7. Require the provenance file as a deliverable

8. Finalize term, termination, survival, and signatures

Key Terms Defined

Related Documents

Dataset Licensing Agreement vs. open dataset license (CC-BY, ODbL, Kaggle download)

Dataset Licensing Agreement vs. Data Processing Agreement

Dataset Licensing Agreement vs. software or End-User License Agreement

Dataset Licensing Agreement vs. Statement of Work

Legal Authorities & Sources

Frequently Asked Questions

Create your Dataset Licensing Agreement in minutes.