(English) Auditing Linux
Identifying all copyright holders, licenses and license obligations within a Linux distribution is one of the most complex and tedious audit activities. In earlier Linux versions, before version 4.19, many source files in the tree are missing licensing information. This makes it hard for compliance tools to determine the correct license, and a manual audit is very time consuming.
By default, all Linux Kernel files without explicit license information are licensed under GPL version 2 with syscall exception (going forward we will refer to this license as GPLv2 w/SE for brevity). If however license is mentioned in the file, then such license should take precedence.
The issue is that in reality not all Kernel files without explicit license declaration are in fact distributed or intended to be distributed under the GPLv2 w/SE license.
To make licensing obligations clearer to users, the Linux foundation started a license cleanup activity in 2017. The aim was to make license compliance easier and more transparent to the end users.
In a partly automated, partly manual approach all files without license information were identified and an SPDX header was added at the beginning of the file. Some effort was made to manually review results of the automatically inserted headers; unfortunately, even after such manual intervention some files were incorrectly licensed. Some of those erroneously licensed files were flagged by community at a later stage, however some other files remain inaccurately licensed till date.
The complexity of the situation is that accuracy of licensing information is tied to the specific version of the Linux Kernel you’re using. Older versions tend to have more errors while newer versions fewer.
To provide specific example we can take cl0002.h file. In older Kernel versions this file does not have a license header. The project where this file resides (drm/nouvuau) is almost entirely licensed under MIT license. That said in Linux Kernel v4.19 this file was automatically tagged with the GPL-2.0 license header. Subsequently in Linux Kernel v5.3 Linux community have noticed the issue and filed a bug against this and some other files (also incorrectly tagged with GPLv2 instead of MIT license) and the license was changed to the correct (original) MIT header. The comment under this commit states: “The bulk SPDX addition made all these files into GPL-2.0 licensed files. However, the remainder of this project is MIT-licensed, these files were simply missing the boiler plate and got caught up in the global update.”
Figure 1: Example bogus GPL license header
The dilemma arises from the fact had the auditor was analyzing Linux 4.19. which license should he or she assign to this file? Shall the license in the header be trusted, which is GPL-2.0? Or is deeper research needed to identify the real origin which is MIT?
According to lawyers specializing in license compliance, the “license cleanup activity” did not intend to change the license of the file. This happened because the algorithm used to automatically tag all source code files in Linux Kernel was not being perfect. This leads to the assumption that the license to be followed for this file is MIT, and not GPL-2.0. Even if the header indicates differently.
What is now problematic is the fact that, depending of the Linux Kernel version, you are using and possibly reviewing for license compliance reasons, several files may have erroneous license information. When license types are analyzed by automated scanning tools which rely on the SPDX-header, or auditors who are not sufficiently experienced, this false information may leak in the resulting Bill of Materials.
To identify files with wrong license headers, it would be necessary to always cross-check the license information in the actual Linux Kernel release with other sources of information (for example other files in sub module). This is a high effort for thousands of files. Alternatively, tools with copy-pasted code analysis may indicate a discrepancy between the license header and the license of the code itself.
Alternative approach to identify some files which had undergone an automated license classification would be to scan the commit messages for license with comments relating to license identification. E.g. all files corrected after the “clean-up project” can then be easily picked up by a distinctive comment and all such files could be analysed manually.
All of the above practices and many more practices are utilized during Software Composition Analysis audits executed by Bitsea where we try our best to ensure we deliver most thorough and unambiguous results.
For further questions please write to: info@bitsea.de