licence-normaliser ****************** [image: licence-normaliser logo][image] Robust licence normalisation with a three-level hierarchy for common licences. [image: PyPI Version][image][image: Supported Python versions][image][image: Build Status][image][image: Documentation Status][image][image: llms.txt - documentation for LLMs][image][image: Ask DeepWiki][image][image: MIT][image][image: Coverage][image] "licence-normaliser" maps common licence representations (SPDX tokens, URLs, prose descriptions) to a canonical three-level hierarchy. Features ======== * **Three-level hierarchy** - LicenceFamily → LicenceName → LicenceVersion. * **Wide format support** - SPDX tokens, URLs, and prose descriptions for supported licences. * **Creative Commons support** - Full CC family with versions and IGO variants. * **Publisher-specific licences** - Springer, Nature, Elsevier, Wiley, ACS, and more. * **File-driven data** - Add aliases, URLs, and patterns by editing JSON files. No Python code changes required for new synonyms. * **Pluggable parsers** - Drop in a new parser class to ingest any external licence registry. Parsers implement plugin interfaces ("RegistryPlugin", "URLPlugin", etc.). * **Strict mode** - Raise "LicenceNotFoundError" instead of silently returning ""unknown"". * **Caching** - LRU caching for performance. * **CLI** - Command-line interface with "--strict" and "--trace" support. Hierarchy ========= The library uses a three-level hierarchy: 1. **LicenceFamily** - broad bucket: ""cc"", ""osi"", ""copyleft"", ""publisher-tdm"", ... 2. **LicenceName** - version-free: ""cc-by"", ""cc-by-nc-nd"", ""mit"", ""wiley-tdm"" 3. **LicenceVersion** - fully resolved: ""cc-by-3.0"", ""cc-by-nc- nd-4.0"" "LicenceVersion" also has optional "jurisdiction" (e.g., ""uk"", ""au"") and "scope" (e.g., ""igo"") fields for CC licences. Installation ============ With "uv": uv pip install licence-normaliser Or with "pip": pip install licence-normaliser Quick start =========== from licence_normaliser import normalise_licence v = normalise_licence("CC BY-NC-ND 4.0") assert str(v) == "cc-by-nc-nd-4.0" # ← LicenceVersion assert str(v.licence) == "cc-by-nc-nd" # ← LicenceName assert str(v.licence.family) == "cc" # ← LicenceFamily # With jurisdiction and scope v = normalise_licence("http://creativecommons.org/licenses/by-nc/2.0/uk") assert v.jurisdiction == "uk" assert v.scope is None v = normalise_licence("http://creativecommons.org/licenses/by-nc/3.0/igo") assert v.jurisdiction is None assert v.scope == "igo" Strict mode =========== By default, unresolvable inputs return an ""unknown"" result. Pass "strict=True" to raise "LicenceNotFoundError" instead: from licence_normaliser import normalise_licence from licence_normaliser.exceptions import LicenceNotFoundError # Silent fallback (default) v = normalise_licence("some-unknown-string") assert v.family.key == "unknown" # Strict: raises on unresolvable input try: v = normalise_licence("some-unknown-string", strict=True) except LicenceNotFoundError as exc: print(exc.raw) # original input print(exc.cleaned) # cleaned form that failed lookup Trace / Explain =============== Set "ENABLE_LICENCE_NORMALISER_TRACE=1" or pass "trace=True" to get resolution traces showing how the licence was matched: from licence_normaliser import normalise_licence # Via function v = normalise_licence("cc by-nc-nd 3.0 igo", trace=True) print(v.explain()) # Via class from licence_normaliser import LicenceNormaliser ln = LicenceNormaliser(trace=True) v = ln.normalise_licence("MIT") print(v.explain()) Output shows the resolution pipeline (alias → registry → url → prose → fallback) and which source file + line matched: Input: 'cc by-nc-nd 3.0 igo' → 'cc by-nc-nd 3.0 igo' [✓] alias: 'cc by-nc-nd 3.0 igo' → 'cc-by-nc-nd-3.0-igo' (line 139 in aliases.json) Result: version_key: 'cc-by-nc-nd-3.0-igo' name_key: 'cc-by-nc-nd' family_key: 'cc' The trace can also be accessed via "v._trace" for programmatic use. Batch normalisation =================== from licence_normaliser import normalise_licences results = normalise_licences(["MIT", "Apache-2.0", "CC BY 4.0"]) for r in results: print(r.key) # Strict batch - raises on first unresolvable results = normalise_licences(["MIT", "Apache-2.0"], strict=True) Custom plugins ============== The "LicenceNormaliser" class lets you inject custom plugin classes for specialised use cases: from licence_normaliser import LicenceNormaliser from licence_normaliser.parsers.alias import AliasParser from licence_normaliser.parsers.spdx import SPDXParser # Use only SPDX + Alias plugins (no CC, no publisher URLs) ln = LicenceNormaliser( registry=[SPDXParser], alias=[AliasParser], family=[AliasParser], name=[AliasParser], cache=True, cache_maxsize=8192, ) # MIT resolves via SPDX parser assert str(ln.normalise_licence("MIT")) == "mit" # CC BY resolves via Alias assert str(ln.normalise_licence("CC BY-NC-ND 4.0")) == "cc-by-nc-nd-4.0" Note: "LicenceNormaliser()" automatically loads the full default set of parsers. To use a reduced set you must explicitly pass *all* six plugin lists (registry, url, alias, family, name, prose). For caching, "LicenceNormaliser" wraps the resolution method with "lru_cache". Disable it by passing "cache=False" for debugging: from licence_normaliser import LicenceNormaliser ln = LicenceNormaliser(cache=False) result = ln.normalise_licence("MIT") Update data (CLI) ================= licence-normaliser update-data --force # Fetches fresh SPDX, OpenDefinition, OSI, CreativeCommons, and ScanCode JSONs Integration tests (public API only) =================================== All integration tests live in "src/licence_normaliser/tests/test_integration.py" and only import the public API. CLI usage ========= Normalise a single licence: licence-normaliser normalise "MIT" # Output: mit licence-normaliser normalise --full "CC BY 4.0" # Output: # Key: cc-by-4.0 # URL: https://creativecommons.org/licenses/by/4.0/ # Licence: cc-by # Family: cc licence-normaliser normalise --strict "totally-unknown" # Exits with code 1 and prints an error Batch normalise: licence-normaliser batch MIT "Apache-2.0" "CC BY 4.0" licence-normaliser batch --strict MIT "Apache-2.0" Exceptions ========== from licence_normaliser.exceptions import ( DataSourceError, # data source loading errors LicenceNormaliserError, # base class LicenceNotFoundError, # raised by strict mode LicenceNormalisationError, # kept for backwards compatibility ) from licence_normaliser import ( LicenceTrace, # resolution trace object LicenceTraceStage, # resolution stage enum ) Testing ======= All tests run inside Docker: make test To test a specific Python version: make test-env ENV=py312 Licence ======= MIT Author ====== Artur Barseghyan Project documentation ===================== Contents: Table of Contents ^^^^^^^^^^^^^^^^^ * licence-normaliser * Features * Hierarchy * Installation * Quick start * Strict mode * Trace / Explain * Batch normalisation * Custom plugins * Update data (CLI) * Integration tests (public API only) * CLI usage * Exceptions * Testing * Licence * Author * Project documentation * Contributor guidelines * Developer prerequisites * Code standards * Virtual environment * Installation * Testing * Adding new normalisation rules * Releases * Adding tests * Pull requests * Questions * Issues * Security Policy * Reporting a Vulnerability * Supported Versions * Release history and notes * 0.6.1 * 0.6 * 0.5.2 * 0.5.1 * 0.5 * 0.4 * 0.3.2 * 0.3.1 * 0.3 * 0.2 * 0.1.1 * 0.1 * Package * Indices and tables * Project source-tree * README.rst * CONTRIBUTING.rst * AGENTS.md * conftest.py * docker-compose.yml * pyproject.toml * scripts/README.rst * scripts/__init__.py * scripts/apply_aliases_patch.py * scripts/check_missing_aliases.py * scripts/compare_datasets.py * scripts/compare_scancode_categories.py * scripts/find_alias_duplicates.py * scripts/migrate_publishers_to_aliases.py * scripts/migrate_url_map_to_aliases.py * scripts/sort_aliases.py * scripts/test_name_inference.py * src/licence_normaliser/__init__.py * src/licence_normaliser/_cache.py * src/licence_normaliser/_core.py * src/licence_normaliser/_models.py * src/licence_normaliser/_normaliser.py * src/licence_normaliser/_trace.py * src/licence_normaliser/cli/__init__.py * src/licence_normaliser/cli/_main.py * src/licence_normaliser/data/README.rst * src/licence_normaliser/data/aliases/aliases.json * src/licence_normaliser/data/creativecommons/creativecommons.json * src/licence_normaliser/data/opendefinition/opendefinition.json * src/licence_normaliser/data/osi/osi.json * src/licence_normaliser/data/prose/prose_patterns.json * src/licence_normaliser/data/scancode_licensedb/scancode_licensedb. json * src/licence_normaliser/data/spdx/spdx.json * src/licence_normaliser/defaults.py * src/licence_normaliser/exceptions.py * src/licence_normaliser/parsers/__init__.py * src/licence_normaliser/parsers/alias.py * src/licence_normaliser/parsers/creativecommons.py * src/licence_normaliser/parsers/opendefinition.py * src/licence_normaliser/parsers/osi.py * src/licence_normaliser/parsers/prose.py * src/licence_normaliser/parsers/scancode_licensedb.py * src/licence_normaliser/parsers/spdx.py * src/licence_normaliser/plugins.py * src/licence_normaliser/tests/__init__.py * src/licence_normaliser/tests/conftest.py * src/licence_normaliser/tests/test_alias_expansion.py * src/licence_normaliser/tests/test_aliases.py * src/licence_normaliser/tests/test_cache.py * src/licence_normaliser/tests/test_cli.py * src/licence_normaliser/tests/test_core.py * src/licence_normaliser/tests/test_exceptions.py * src/licence_normaliser/tests/test_integration.py * src/licence_normaliser/tests/test_models.py * src/licence_normaliser/tests/test_prose.py * src/licence_normaliser/tests/test_trace.py