305 lines
17 KiB
Plaintext
305 lines
17 KiB
Plaintext
<p>We took all 166 699 packages from <a href="https://rubygems.org/">RubyGems.org</a> and
|
||
rebuilt them in <a href="https://copr.fedorainfracloud.org/">Copr</a>. Let’s explore the results together.</p>
|
||
|
||
<h2 id="success-rate">Success rate</h2>
|
||
|
||
<p>From the 166 699 Gems hosted on <a href="https://rubygems.org/">RubyGems.org</a>,
|
||
98 816 of them were successfully built in <a href="https://copr.fedorainfracloud.org/">Copr</a> for Fedora
|
||
Rawhide. That makes a 59.3% success rate. For the rest of them, it
|
||
is important to distinguish in what build phase they failed. Out of
|
||
67 883 failures, 62 717 of them happened while converting
|
||
their <a href="https://bundler.io/gemfile.html">Gemfile</a> into <a href="https://rpm-packaging-guide.github.io/#what-is-a-spec-file">spec</a> and only 5 166 when
|
||
building the actual RPM packages. It means that if a Gem can be
|
||
properly converted to a <a href="https://rpm-packaging-guide.github.io/#what-is-a-spec-file">spec</a> file, there is a 95% probability
|
||
for it to be successfully built into RPM.</p>
|
||
|
||
<div class="text-center img-row row">
|
||
<a href="https://frostyx.cz/files/img/rubygems-success-rate.png" title="The exact number of failures caused by missing license vs other SRPM failures will be updated">
|
||
<img src="https://frostyx.cz/files/img/rubygems-success-rate.png" />
|
||
</a>
|
||
</div>
|
||
|
||
<p>By far, the majority of failures were caused by a missing license
|
||
field for the particular Gems. There is likely nothing wrong with
|
||
them, and technically, they could be built without any issues, but we
|
||
simply don’t have legal rights to do so. Therefore such builds were
|
||
aborted before even downloading the sources. This affected 62 049
|
||
packages.</p>
|
||
|
||
<h2 id="more-stats">More stats</h2>
|
||
|
||
<p>All Gems were rebuilt within the <a href="https://copr.fedorainfracloud.org/coprs/g/rubygems/rubygems/">@rubygems/rubygems</a>
|
||
Copr project for <code class="language-plaintext highlighter-rouge">fedora-rawhide-x86_64</code> and <code class="language-plaintext highlighter-rouge">fedora-rawhide-i386</code>.</p>
|
||
|
||
<p>We submitted all builds <em>at once</em>, starting on Sep 11, 2021, and the
|
||
whole rebuild was finished on Oct 17, 2021. It took Copr a little over
|
||
a month, and within that time, the number of pending builds peaked at
|
||
129 515.</p>
|
||
|
||
<div class="text-center img-row row">
|
||
<a href="https://frostyx.cz/files/img/rubygems-builds-graph.png" title="The pending builds peaked at 129515">
|
||
<img src="https://frostyx.cz/files/img/rubygems-builds-graph.png" />
|
||
</a>
|
||
</div>
|
||
|
||
<p>The number of running builds doesn’t represent 24 468 running
|
||
builds at once but rather the number of builds that entered the
|
||
<code class="language-plaintext highlighter-rouge">running</code> state on that day. It doesn’t represent Copr throughput
|
||
accurately though, as we worked on eliminating performance issues
|
||
along the way. A similar mass rebuild should take a fraction of the
|
||
time now.</p>
|
||
|
||
<p>The resulting RPM packages ate 55GB per chroot, therefore 110GB in
|
||
total. SRPM packages in the amount of 640MB were created as a
|
||
byproduct.</p>
|
||
|
||
<p>The repository metadata has 130MB and it takes DNF around 5 minutes on
|
||
my laptop (Lenovo X1 Carbon) to enable the repository and install a package
|
||
from it for the first time (because it needs to create a cache).
|
||
Consequent installations from the repository are instant.</p>
|
||
|
||
<h2 id="in-perspective">In perspective</h2>
|
||
|
||
<p>To realize if those numbers are anyhow significant or interesting, I
|
||
think we need to compare them with other repositories.</p>
|
||
|
||
<table class="table table-bordered table-hover">
|
||
<thead>
|
||
<tr>
|
||
<th>#</th>
|
||
<th>@rubygems/rubygems</th>
|
||
<th>Fedora Rawhide (F36)</th>
|
||
<th>EPEL8</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody>
|
||
<tr>
|
||
<td><strong>The Number of packages</strong></td>
|
||
<td>98 816</td>
|
||
<td>34 062</td>
|
||
<td>4 806</td>
|
||
</tr>
|
||
<tr>
|
||
<td><strong>Size per chroot</strong></td>
|
||
<td>55GB</td>
|
||
<td>83GB</td>
|
||
<td>6.7GB</td>
|
||
</tr>
|
||
<tr>
|
||
<td><strong>Metadata size</strong></td>
|
||
<td>130MB</td>
|
||
<td>61MB</td>
|
||
<td>11MB</td>
|
||
</tr>
|
||
<tr>
|
||
<td><code class="language-plaintext highlighter-rouge">dnf makecache</code></td>
|
||
<td>~5 minutes</td>
|
||
<td>~22 seconds</td>
|
||
<td>1 second</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
|
||
<h2 id="motivation">Motivation</h2>
|
||
|
||
<p>What was the point of this <em>experiment</em> anyway?</p>
|
||
|
||
<p>The goal was to rebuild all packages from a third-party hosting
|
||
service that is specific to some programming language. There was no
|
||
particular reason why we chose <a href="https://rubygems.org/">RubyGems.org</a> among other
|
||
options.</p>
|
||
|
||
<p>We hoped to pioneer this area, figure out the pain points, and make it
|
||
easier for others to mass-rebuild something that might be helpful to them.
|
||
While doing so, we had the opportunity to improve the Copr service and
|
||
test the performance of the whole RPM toolchain against large repositories.</p>
|
||
|
||
<p>There are reasons why to avoid installing packages directly via <code class="language-plaintext highlighter-rouge">gem</code>,
|
||
<code class="language-plaintext highlighter-rouge">pip</code>, etc, but that’s for a whole other discussion. Let me just
|
||
reference a <a href="https://stackoverflow.com/a/33584893/3285282">brief answer from Stack Overflow</a>.</p>
|
||
|
||
<h2 id="internals">Internals</h2>
|
||
|
||
<p>Surprisingly enough, the mass rebuild itself wasn’t that
|
||
challenging. The real work manifested itself as its consequences
|
||
(unfair queue, slow <code class="language-plaintext highlighter-rouge">createrepo_c</code>, timeouts everywhere). Rebuilding
|
||
the whole <a href="https://rubygems.org/">RubyGems.org</a> was as easy as:</p>
|
||
|
||
<ol>
|
||
<li>
|
||
<p>Figuring out a way to convert a <a href="https://bundler.io/gemfile.html">Gemfile</a> into
|
||
<a href="https://rpm-packaging-guide.github.io/#what-is-a-spec-file">spec</a>. Thank you, <a href="https://github.com/fedora-ruby/gem2rpm">gem2rpm</a>!</p>
|
||
</li>
|
||
<li>
|
||
<p>Figuring out how to submit a single Gem into Copr. In this case, we
|
||
have built-in support for <a href="https://github.com/fedora-ruby/gem2rpm">gem2rpm</a> (see
|
||
<a href="https://docs.pagure.org/copr.copr/user_documentation.html#rubygems">the documentation</a>), therefore it was as easy as
|
||
<code class="language-plaintext highlighter-rouge">copr-cli buildgem ...</code>. Similarly, we have built-in support for
|
||
PyPI. For anything else, you would have to utilize the
|
||
<a href="https://docs.pagure.org/copr.copr/custom_source_method.html#custom-source-method">Custom source method</a> (at least until the support
|
||
for such tool/service is built into Copr directly).</p>
|
||
</li>
|
||
<li>
|
||
<p>Iterating over the whole <a href="https://rubygems.org/">RubyGems.org</a> repository and
|
||
submitting gems one by one. A simple script is more than
|
||
sufficient, but we utilized <a href="https://github.com/fedora-copr/copr-rebuild-tools">copr-rebuild-tools</a>
|
||
that I wrote many years ago.</p>
|
||
</li>
|
||
<li>
|
||
<p>Setting up automatic rebuilds of new Gems. The
|
||
<a href="https://release-monitoring.org/">release-monitoring.org</a> (aka Anitya) is
|
||
perfect for that. We <a href="https://pagure.io/copr/copr/blob/main/f/frontend/coprs_frontend/run/check_for_anitya_version_updates.py">check</a> for new
|
||
<a href="https://rubygems.org/">RubyGems.org</a> updates every hour, and it would be
|
||
trivial to add support for any other <a href="https://release-monitoring.org/static/docs/user-guide.html#backends">backend</a>.
|
||
Thanks to Anitya, the repository will always provide the most
|
||
up-to-date packages.</p>
|
||
</li>
|
||
</ol>
|
||
|
||
<h2 id="takeaway-for-rubygems">Takeaway for RubyGems</h2>
|
||
|
||
<p>If you maintain any Gems, please make sure that you have properly set
|
||
their license. If you develop or maintain any piece of software, for
|
||
that matter, please make sure it is publicly known under which license
|
||
it is available.</p>
|
||
|
||
<p>Contrary to the common belief, unlicensed software, even though
|
||
publicly available on <a href="https://github.com/">GitHub</a> or <a href="https://rubygems.org/">RubyGems</a>, is in
|
||
fact protected by copyright, and therefore cannot be legally used
|
||
(because a license is needed to grant usage rights). As such,
|
||
unlicensed software is neither <a href="https://www.gnu.org/philosophy/free-sw.en.html">Free software</a> nor
|
||
<a href="https://opensource.com/resources/what-open-source">open source</a>, even though technically it can be
|
||
downloaded and installed by anyone.</p>
|
||
|
||
<p>If I could have a wishful message towards <a href="https://rubygems.org/">RubyGems.org</a>
|
||
maintainers, please consider placing a higher significance on
|
||
licensing and make it <a href="https://guides.rubygems.org/specification-reference/">required instead of
|
||
recommended</a>.</p>
|
||
|
||
<p>For the reference, here is a list of all 65 206 unlicensed Gems
|
||
generated by the following script (on Nov 14 2021).
|
||
https://gist.github.com/FrostyX/e324c667c97ff80d7f145f5c2c936f27#file-rubygems-unlicensed-list</p>
|
||
|
||
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
|
||
<span class="k">for </span>gem <span class="k">in</span> <span class="si">$(</span>gem search <span class="nt">--remote</span> |cut <span class="nt">-d</span> <span class="s2">" "</span> <span class="nt">-f1</span><span class="si">)</span> <span class="p">;</span> <span class="k">do
|
||
</span><span class="nv">url</span><span class="o">=</span><span class="s2">"https://rubygems.org/api/v1/gems/</span><span class="nv">$gem</span><span class="s2">.json"</span>
|
||
<span class="nv">metadata</span><span class="o">=</span><span class="si">$(</span>curl <span class="nt">-s</span> <span class="nv">$url</span><span class="si">)</span>
|
||
<span class="k">if</span> <span class="o">!</span> <span class="nb">echo</span> <span class="nv">$metadata</span> |jq <span class="nt">-e</span> <span class="s1">'.licenses |select(type == "array" and length > 0)'</span><span class="se">\</span>
|
||
<span class="o">></span>/dev/null<span class="p">;</span> <span class="k">then
|
||
</span><span class="nb">echo</span> <span class="nv">$metadata</span> |jq <span class="nt">-r</span> <span class="s1">'.name'</span>
|
||
<span class="k">fi
|
||
done</span>
|
||
</code></pre></div></div>
|
||
|
||
<p>There are also 3 157 packages that don’t have their license field set
|
||
on <a href="https://rubygems.org/">RubyGems.org</a> but we were able to parse their license
|
||
from the sources.
|
||
https://gist.github.com/FrostyX/e324c667c97ff80d7f145f5c2c936f27#file-rubygems-license-only-in-sources-list</p>
|
||
|
||
<h2 id="takeaway-for-dnf">Takeaway for DNF</h2>
|
||
|
||
<p>It turns out DNF handles large repositories without any major
|
||
difficulties. The only inconvenience is how long it takes to create
|
||
its cache. To reproduce, enable the repository.</p>
|
||
|
||
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dnf copr enable @rubygems/rubygems
|
||
</code></pre></div></div>
|
||
|
||
<p>And create the cache from scratch. It will take a while (5 minutes for
|
||
the single repo on my machine).</p>
|
||
|
||
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dnf makecache
|
||
</code></pre></div></div>
|
||
|
||
<p>I am not that familiar with DNF internals, so I don’t really know if
|
||
this is something that can be fixed. But it would certainly be
|
||
worth exploring if any performance improvements can be done.</p>
|
||
|
||
<h2 id="takeaway-for-createrepo_c">Takeaway for createrepo_c</h2>
|
||
|
||
<p>We cooperated with <code class="language-plaintext highlighter-rouge">createrepo_c</code> developers on multiple performance
|
||
improvements in the past, and these days <code class="language-plaintext highlighter-rouge">createrepo_c</code> works
|
||
perfectly for large repositories. There is nothing crucial left to do,
|
||
so I would like to briefly describe how to utilize <code class="language-plaintext highlighter-rouge">createrepo_c</code>
|
||
optimization features instead.</p>
|
||
|
||
<p>First <code class="language-plaintext highlighter-rouge">createrepo_c</code> run for a large repo will always be slow, so just
|
||
get over it. Use the <code class="language-plaintext highlighter-rouge">--workers</code> parameter to specify how many threads
|
||
should be spawned for reading RPMs. While this brings a significant
|
||
speedup (and cuts the time to half), the problem is, that even listing
|
||
a large directory is too expensive. It will take tens of minutes.</p>
|
||
|
||
<p>Specify the <code class="language-plaintext highlighter-rouge">--pkglist</code> parameter to let <code class="language-plaintext highlighter-rouge">createrepo_c</code> generate a new
|
||
file containing the list of all packages in the repository. It will
|
||
help us to speed up the consecutive <code class="language-plaintext highlighter-rouge">createrepo_c</code> runs. For them,
|
||
specify also <code class="language-plaintext highlighter-rouge">--update</code>, <code class="language-plaintext highlighter-rouge">--recycle-pkglist</code>, and <code class="language-plaintext highlighter-rouge">--skip-stat</code>. The
|
||
repository regeneration will take only a couple of seconds
|
||
(<a href="https://github.com/rpm-software-management/createrepo_c/commit/437451f3bea5430c0a6f678b2a65ebbbbcb12de0">437451f</a>).</p>
|
||
|
||
<h2 id="takeaway-for-appstream-builder">Takeaway for appstream-builder</h2>
|
||
|
||
<p>On the other hand, <code class="language-plaintext highlighter-rouge">appstream-builder</code> takes more than 20 minutes to
|
||
finish, and we didn’t find any way to make it run faster. As a
|
||
(hopefully) temporary solution, we added a possibility to disable
|
||
<a href="https://www.freedesktop.org/software/appstream/docs/">AppStream</a> metadata generation for a given project
|
||
(<a href="https://pagure.io/copr/copr/pull-request/742">PR#742</a>), and recommend owners of large projects to do so.</p>
|
||
|
||
<p>From the long-term perspective, it may be worth checking out whether
|
||
there are some possibilities to improve the <code class="language-plaintext highlighter-rouge">appstream-builder</code>
|
||
performance. If you are interested, see upstream issue
|
||
<a href="https://github.com/hughsie/appstream-glib/issues/301">#301</a>.</p>
|
||
|
||
<h2 id="takeaway-for-copr">Takeaway for Copr</h2>
|
||
|
||
<p>The month of September turned into one big stress test, causing Copr
|
||
to be temporarily incapacitated but helping us provide a better
|
||
service in the long-term. Because we never had such a big project in
|
||
the past, we experienced and fixed several issues in the UX and data
|
||
handling on the frontend and backend. Here are some of them:</p>
|
||
|
||
<ul>
|
||
<li>Due to periodically logging all pending builds, the apache log
|
||
skyrocketed to 20GB and consumed all available disk space
|
||
(<a href="https://pagure.io/copr/copr/pull-request/1916">PR#1916</a>).</li>
|
||
<li>Timeouts when updating project settings (<a href="https://pagure.io/copr/copr/pull-request/1968">PR#1968</a>)</li>
|
||
<li>Unfair repository locking caused some builds unjustifiably long to
|
||
be finished (<a href="https://pagure.io/copr/copr/pull-request/1927">PR#1927</a>).</li>
|
||
<li>We used to delegate pagination to the client to provide a
|
||
better user experience (and honestly, to avoid implementing it
|
||
ourselves). This made listing builds and packages in a large project
|
||
either take a long time or timeout. We switched to backend
|
||
pagination for projects with more than 10 000 builds/packages
|
||
(<a href="https://pagure.io/copr/copr/pull-request/1908">PR#1908</a>).</li>
|
||
<li>People used to scrap the monitor page of their projects but that
|
||
isn’t an option anymore due to the more conservative pagination
|
||
implementation. Therefore we added proper support for project
|
||
monitor into the API and <code class="language-plaintext highlighter-rouge">copr-cli</code> (<a href="https://pagure.io/copr/copr/pull-request/1953">PR#1953</a>).</li>
|
||
<li>The API call for obtaining all project builds was too slow for large
|
||
projects. In the case of the <code class="language-plaintext highlighter-rouge">@rubygems/rubygems</code> project, we
|
||
managed to reduce the required time from around 42 minutes to 13
|
||
minutes (<a href="https://pagure.io/copr/copr/pull-request/1930">PR#1930</a>).</li>
|
||
<li>The <code class="language-plaintext highlighter-rouge">copr-cli</code> command for listing all project packages was too slow
|
||
and didn’t continuously print the output. In the case of the
|
||
<code class="language-plaintext highlighter-rouge">@rubygems/rubygems</code> project, we reduced its time from around 40
|
||
minutes to 35 seconds (<a href="https://pagure.io/copr/copr/pull-request/1914">PR#1914</a>).</li>
|
||
</ul>
|
||
|
||
<h2 id="lets-build-more">Let’s build more</h2>
|
||
|
||
<p>To achieve such mass rebuild, no special permissions, proprietary
|
||
tools, or any requirements were necessary. Any user could have done
|
||
it. In fact, some of them already did.</p>
|
||
|
||
<ul>
|
||
<li><a href="https://copr.fedorainfracloud.org/coprs/iucar/cran/">iucar/cran</a></li>
|
||
<li><a href="https://copr.fedorainfracloud.org/coprs/g/python/python3.10/">@python/python3.10</a></li>
|
||
<li><a href="https://pypi.org/">PyPI</a> rebuild is being worked on by
|
||
<a href="https://github.com/befeleme">Karolina Surma</a></li>
|
||
</ul>
|
||
|
||
<p>But don’t be fooled, Copr can handle more. Will somebody try
|
||
<a href="https://www.npmjs.com/">Npm</a>, <a href="https://packagist.org/">Packagist</a>, <a href="https://hackage.haskell.org/">Hackage</a>, <a href="https://www.cpan.org/">CPAN</a>,
|
||
<a href="https://elpa.gnu.org/">ELPA</a>, etc? Let us know.</p>
|
||
|
||
<p>I would suggest starting with <a href="https://docs.pagure.org/copr.copr/user_documentation.html#mass-rebuilds">Copr Mass Rebuilds
|
||
documentation</a>.</p> |