Bisquat2: What is hiding there?

Today, we are shedding light on a topic that is still all too readily overlooked as the “little sister of programming”. What hardly anyone cared about 20 years ago is to be placed under state control in the immediate future!

As we now know, a major focus of Bitsea is checking for hidden risks in software. Many people typically first think about cybersecurity, or generally of outdated or poorly maintained components, as we have seen with Log4J. We have heard already that Bisquat2 is also used to analyze design pattern violations. However, hardly anyone will immediately think of open source and licenses.

In the meantime, open source analysis has become the most important pillar of Bitsea. We support our customers in identifying and documenting every component of their software. This is used to create a “Software Bill of Materials” (SBOM). All components are listed in the SBOM with licenses, copyrights and vulnerabilities, allowing the software to be checked for risks. According to the new Cyber Resilience Act (CRA) or the Digital Operational Resilience Act (DORA), all companies that distribute their software commercially or, in the case of DORA, are working in the domain of the financial sector, are obliged to carry out a risk assessment. The basis for this is the SBOM. You can find more detailed information about both regulations here for DORA and here for CRA. The SBOM is an important tool for dealing with difficulties and vulnerabilities in software, i.e. documenting them and communicating them to others.

Licenses are tricky. People who have worked in IT in critical infrastructure before know this. Many licenses have certain conditions if associated libraries are used commercially. This may mean that the source code with these licenses must be made publicly accessible. It is understandable that many large corporations and especially medium-sized companies do not want to make their software available to the general public. Just imagine this scenario with a large energy supplier… What’s more, you want to be able to sit in your electrified car after the latest software update knowing that its software will remain a company secret for years to come – without outside interference.

But now that even well-known manufacturers can no longer do without open source components in their software, as developing everything from scratch is expensive and time-consuming, companies such as Bitsea are being commissioned to find out what is hidden there.

Analysis shows that an average software product contains around 80 percent open source code. Fact is, the image of an iceberg with its hidden size under water fits like a glove. Many open source projects supposedly show their licenses on Github, but if you do a deep license analysis, it quickly becomes clear that there is often more going on than meets the eye. In contrast to commercial software, where you (often) know exactly what it contains, open source software is often a difficult-to-comprehend hodgepodge.

With Bisquat2, we have primarily developed a tool for analyzing the quality of source code. However, as we deal with licenses on a daily basis in our work, we wanted to incorporate tools that could support us in this.

We already have very good tools at our disposal for license analysis, such as Flexera SCA, Black Duck, Fossology, Scancode and ORT. We use these tools to audit our customers’ source code efficiently and quickly using a multi-factor approach: every single file is checked, and even problematic code fragments are detected if necessary. Binary and media files can be analyzed. As a result, we create an SBOM in which copyrights, license names, license text, various vulnerabilities and security gaps and other parameters of the components are listed.

Bisquat2 is now able to read this SBOM and check certain aspects of the components in more detail: By extracting data from the reports, Bisquat2 now allows us to check, for example, whether the license texts match the original SPDX license templates based on regular expressions. This is a function that is very important: many authors modify their license texts, some seriously so that the change is immediately noticeable, some only minimally. This makes it difficult to distinguish these personalized license texts from the original. If there is even the slightest change, the license text must be declared accordingly in order to avoid legal consequences. Since a license text often consists of different passages, this comparison is unfortunately not easy to implement: In addition to license passages that cannot be changed, there are optional passages that can be omitted, but if they are included in the text, they must match the template exactly. There are also variable passages in which an individually adapted text may appear.

Another gimmick is our similarity search for licenses in Bisquat2. Many licenses look very similar at first glance. In our search, we enter the license text and get a list of license suggestions that have the highest similarity. As we now have a very large database of licenses, including personalized license texts, we can assign the licenses more quickly. For outsiders, this function may not seem important at first sight, but it makes everyday life much easier, as a tool with such a function did not previously exist.

A further step is to retrieve and process metadata from repositories. To do this, we need data that indicates the operational risk and sustainability of a project. For example, the number of authors, the number of commits or pull requests and their timestamps provide information about how actively a project is maintained or improved. If these numbers are low, it can be concluded that there is a risk of vulnerabilities and security gaps if patches are no longer made available on a regular basis.

Finally, I would like to elaborate on relationships between licenses. This is probably the most important and toughest part of the whole license story. As we already know, different licenses can be very different. Based on obligations we divide them into four categories and describe how they are handled. Category one represents “strong copyleft” and “weak copyleft” licenses that require users of their projects to republish all (or parts) of the associated code, under specific conditions. They require that this code be made available to the public again. In our second category, we classify commercial or unknown licenses. Here it must first be checked whether either a commercial license must be acquired or, in the case of the unknown, whether the clauses described therein fit the desired usage scenario. The third category refers to “permissive” licenses. These licenses have only minimal restrictions. The code does not have to be freely published again, usually only the copyright should be retained in order to do justice to the author. As a result, software that works with these open source components can be used in a “protected” manner. Our fourth category deals with dual licensing, meaning that developers can choose from two or more licenses the one that is most suitable for their own project.

Compatibility is another point that needs to be mentioned. Not all licenses are compatible with each other. Some licenses exclude the conditions of other licenses. For example, the “copy-left” license GPL-2.0-only is not compatible with the permissive license Apache-1.0. Legal advice in this regard is therefore essential.

In addition, linking between the components play a major role in how they can or may be handled. Depending on how many components are contained in a software, the overview of their special obligations can get quite complex.

With the created report, whose tabular view is still difficult for most people to keep track of, we generate a 3D visualization in Bisquat2. With this visualization, which shows a FOSS graph in 3D and at the same time breaks down the open source components, we give a more understandable insight into this difficult legal requirement. You can see chains of dependencies, licenses which have special obligations, or security risks.

I will describe exactly how we implemented this with our partners at H-BRS in the next blog. We will also take a closer look at another visualization currently being implemented by one of our interns, which is part of his bachelor’s thesis.