emacs/var/elfeed/db/data/b4/b4e98f0b25cec3cc55f24423162c42809284f799

<p>We took all 166 699 packages from <a href="https://rubygems.org/">RubyGems.org</a> and
rebuilt them in <a href="https://copr.fedorainfracloud.org/">Copr</a>. Let’s explore the results together.</p>

<h2 id="success-rate">Success rate</h2>

<p>From the 166 699 Gems hosted on <a href="https://rubygems.org/">RubyGems.org</a>,
98 816 of them were successfully built in <a href="https://copr.fedorainfracloud.org/">Copr</a> for Fedora
Rawhide. That makes a 59.3% success rate. For the rest of them, it
is important to distinguish in what build phase they failed. Out of
67 883 failures, 62 717 of them happened while converting
their <a href="https://bundler.io/gemfile.html">Gemfile</a> into <a href="https://rpm-packaging-guide.github.io/#what-is-a-spec-file">spec</a> and only 5 166 when
building the actual RPM packages. It means that if a Gem can be
properly converted to a <a href="https://rpm-packaging-guide.github.io/#what-is-a-spec-file">spec</a> file, there is a 95% probability
for it to be successfully built into RPM.</p>

<div class="text-center img-row row">
  <a href="https://frostyx.cz/files/img/rubygems-success-rate.png" title="The exact number of failures caused by missing license vs other SRPM failures will be updated">
    <img src="https://frostyx.cz/files/img/rubygems-success-rate.png" />
  </a>
</div>

<p>By far, the majority of failures were caused by a missing license
field for the particular Gems. There is likely nothing wrong with
them, and technically, they could be built without any issues, but we
simply don’t have legal rights to do so. Therefore such builds were
aborted before even downloading the sources. This affected 62 049
packages.</p>

<h2 id="more-stats">More stats</h2>

<p>All Gems were rebuilt within the <a href="https://copr.fedorainfracloud.org/coprs/g/rubygems/rubygems/">@rubygems/rubygems</a>
Copr project for <code class="language-plaintext highlighter-rouge">fedora-rawhide-x86_64</code> and <code class="language-plaintext highlighter-rouge">fedora-rawhide-i386</code>.</p>

<p>We submitted all builds <em>at once</em>, starting on Sep 11, 2021, and the
whole rebuild was finished on Oct 17, 2021. It took Copr a little over
a month, and within that time, the number of pending builds peaked at
129 515.</p>

<div class="text-center img-row row">
  <a href="https://frostyx.cz/files/img/rubygems-builds-graph.png" title="The pending builds peaked at 129515">
    <img src="https://frostyx.cz/files/img/rubygems-builds-graph.png" />
  </a>
</div>

<p>The number of running builds doesn’t represent 24 468 running
builds at once but rather the number of builds that entered the
<code class="language-plaintext highlighter-rouge">running</code> state on that day. It doesn’t represent Copr throughput
accurately though, as we worked on eliminating performance issues
along the way. A similar mass rebuild should take a fraction of the
time now.</p>

<p>The resulting RPM packages ate 55GB per chroot, therefore 110GB in
total. SRPM packages in the amount of 640MB were created as a
byproduct.</p>

<p>The repository metadata has 130MB and it takes DNF around 5 minutes on
my laptop (Lenovo X1 Carbon) to enable the repository and install a package
from it for the first time (because it needs to create a cache).
Consequent installations from the repository are instant.</p>

<h2 id="in-perspective">In perspective</h2>

<p>To realize if those numbers are anyhow significant or interesting, I
think we need to compare them with other repositories.</p>

<table class="table table-bordered table-hover">
  <thead>
    <tr>
      <th>#</th>
      <th>@rubygems/rubygems</th>
      <th>Fedora Rawhide (F36)</th>
      <th>EPEL8</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>The Number of packages</strong></td>
      <td>98 816</td>
      <td>34 062</td>
      <td>4 806</td>
    </tr>
    <tr>
      <td><strong>Size per chroot</strong></td>
      <td>55GB</td>
      <td>83GB</td>
      <td>6.7GB</td>
    </tr>
    <tr>
      <td><strong>Metadata size</strong></td>
      <td>130MB</td>
      <td>61MB</td>
      <td>11MB</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">dnf makecache</code></td>
      <td>~5 minutes</td>
      <td>~22 seconds</td>
      <td>1 second</td>
    </tr>
  </tbody>
</table>

<h2 id="motivation">Motivation</h2>

<p>What was the point of this <em>experiment</em> anyway?</p>

<p>The goal was to rebuild all packages from a third-party hosting
service that is specific to some programming language. There was no
particular reason why we chose <a href="https://rubygems.org/">RubyGems.org</a> among other
options.</p>

<p>We hoped to pioneer this area, figure out the pain points, and make it
easier for others to mass-rebuild something that might be helpful to them.
While doing so, we had the opportunity to improve the Copr service and
test the performance of the whole RPM toolchain against large repositories.</p>

<p>There are reasons why to avoid installing packages directly via <code class="language-plaintext highlighter-rouge">gem</code>,
<code class="language-plaintext highlighter-rouge">pip</code>, etc, but that’s for a whole other discussion. Let me just
reference a <a href="https://stackoverflow.com/a/33584893/3285282">brief answer from Stack Overflow</a>.</p>

<h2 id="internals">Internals</h2>

<p>Surprisingly enough, the mass rebuild itself wasn’t that
challenging. The real work manifested itself as its consequences
(unfair queue, slow <code class="language-plaintext highlighter-rouge">createrepo_c</code>, timeouts everywhere). Rebuilding
the whole <a href="https://rubygems.org/">RubyGems.org</a> was as easy as:</p>

<ol>
  <li>
    <p>Figuring out a way to convert a <a href="https://bundler.io/gemfile.html">Gemfile</a> into
<a href="https://rpm-packaging-guide.github.io/#what-is-a-spec-file">spec</a>. Thank you, <a href="https://github.com/fedora-ruby/gem2rpm">gem2rpm</a>!</p>
  </li>
  <li>
    <p>Figuring out how to submit a single Gem into Copr. In this case, we
have built-in support for <a href="https://github.com/fedora-ruby/gem2rpm">gem2rpm</a> (see
<a href="https://docs.pagure.org/copr.copr/user_documentation.html#rubygems">the documentation</a>), therefore it was as easy as
<code class="language-plaintext highlighter-rouge">copr-cli buildgem ...</code>. Similarly, we have built-in support for
PyPI. For anything else, you would have to utilize the
<a href="https://docs.pagure.org/copr.copr/custom_source_method.html#custom-source-method">Custom source method</a> (at least until the support
for such tool/service is built into Copr directly).</p>
  </li>
  <li>
    <p>Iterating over the whole <a href="https://rubygems.org/">RubyGems.org</a> repository and
submitting gems one by one. A simple script is more than
sufficient, but we utilized <a href="https://github.com/fedora-copr/copr-rebuild-tools">copr-rebuild-tools</a>
that I wrote many years ago.</p>
  </li>
  <li>
    <p>Setting up automatic rebuilds of new Gems. The
<a href="https://release-monitoring.org/">release-monitoring.org</a> (aka Anitya) is
perfect for that. We <a href="https://pagure.io/copr/copr/blob/main/f/frontend/coprs_frontend/run/check_for_anitya_version_updates.py">check</a> for new
<a href="https://rubygems.org/">RubyGems.org</a> updates every hour, and it would be
trivial to add support for any other <a href="https://release-monitoring.org/static/docs/user-guide.html#backends">backend</a>.
Thanks to Anitya, the repository will always provide the most
up-to-date packages.</p>
  </li>
</ol>

<h2 id="takeaway-for-rubygems">Takeaway for RubyGems</h2>

<p>If you maintain any Gems, please make sure that you have properly set
their license. If you develop or maintain any piece of software, for
that matter, please make sure it is publicly known under which license
it is available.</p>

<p>Contrary to the common belief, unlicensed software, even though
publicly available on <a href="https://github.com/">GitHub</a> or <a href="https://rubygems.org/">RubyGems</a>, is in
fact protected by copyright, and therefore cannot be legally used
(because a license is needed to grant usage rights). As such,
unlicensed software is neither <a href="https://www.gnu.org/philosophy/free-sw.en.html">Free software</a> nor
<a href="https://opensource.com/resources/what-open-source">open source</a>, even though technically it can be
downloaded and installed by anyone.</p>

<p>If I could have a wishful message towards <a href="https://rubygems.org/">RubyGems.org</a>
maintainers, please consider placing a higher significance on
licensing and make it <a href="https://guides.rubygems.org/specification-reference/">required instead of
recommended</a>.</p>

<p>For the reference, here is a list of all 65 206 unlicensed Gems
generated by the following script (on Nov 14 2021).
https://gist.github.com/FrostyX/e324c667c97ff80d7f145f5c2c936f27#file-rubygems-unlicensed-list</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="k">for </span>gem <span class="k">in</span> <span class="si">$(</span>gem search <span class="nt">--remote</span> |cut <span class="nt">-d</span> <span class="s2">" "</span> <span class="nt">-f1</span><span class="si">)</span> <span class="p">;</span> <span class="k">do
   </span><span class="nv">url</span><span class="o">=</span><span class="s2">"https://rubygems.org/api/v1/gems/</span><span class="nv">$gem</span><span class="s2">.json"</span>
   <span class="nv">metadata</span><span class="o">=</span><span class="si">$(</span>curl <span class="nt">-s</span> <span class="nv">$url</span><span class="si">)</span>
   <span class="k">if</span> <span class="o">!</span> <span class="nb">echo</span> <span class="nv">$metadata</span> |jq <span class="nt">-e</span> <span class="s1">'.licenses |select(type == "array" and length &gt; 0)'</span><span class="se">\</span>
      <span class="o">&gt;</span>/dev/null<span class="p">;</span> <span class="k">then
       </span><span class="nb">echo</span> <span class="nv">$metadata</span> |jq <span class="nt">-r</span> <span class="s1">'.name'</span>
   <span class="k">fi
done</span>
</code></pre></div></div>

<p>There are also 3 157 packages that don’t have their license field set
on <a href="https://rubygems.org/">RubyGems.org</a> but we were able to parse their license
from the sources.
https://gist.github.com/FrostyX/e324c667c97ff80d7f145f5c2c936f27#file-rubygems-license-only-in-sources-list</p>

<h2 id="takeaway-for-dnf">Takeaway for DNF</h2>

<p>It turns out DNF handles large repositories without any major
difficulties. The only inconvenience is how long it takes to create
its cache. To reproduce, enable the repository.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dnf copr enable @rubygems/rubygems
</code></pre></div></div>

<p>And create the cache from scratch. It will take a while (5 minutes for
the single repo on my machine).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dnf makecache
</code></pre></div></div>

<p>I am not that familiar with DNF internals, so I don’t really know if
this is something that can be fixed. But it would certainly be
worth exploring if any performance improvements can be done.</p>

<h2 id="takeaway-for-createrepo_c">Takeaway for createrepo_c</h2>

<p>We cooperated with <code class="language-plaintext highlighter-rouge">createrepo_c</code> developers on multiple performance
improvements in the past, and these days <code class="language-plaintext highlighter-rouge">createrepo_c</code> works
perfectly for large repositories. There is nothing crucial left to do,
so I would like to briefly describe how to utilize <code class="language-plaintext highlighter-rouge">createrepo_c</code>
optimization features instead.</p>

<p>First <code class="language-plaintext highlighter-rouge">createrepo_c</code> run for a large repo will always be slow, so just
get over it. Use the <code class="language-plaintext highlighter-rouge">--workers</code> parameter to specify how many threads
should be spawned for reading RPMs. While this brings a significant
speedup (and cuts the time to half), the problem is, that even listing
a large directory is too expensive. It will take tens of minutes.</p>

<p>Specify the <code class="language-plaintext highlighter-rouge">--pkglist</code> parameter to let <code class="language-plaintext highlighter-rouge">createrepo_c</code> generate a new
file containing the list of all packages in the repository. It will
help us to speed up the consecutive <code class="language-plaintext highlighter-rouge">createrepo_c</code> runs. For them,
specify also <code class="language-plaintext highlighter-rouge">--update</code>, <code class="language-plaintext highlighter-rouge">--recycle-pkglist</code>, and <code class="language-plaintext highlighter-rouge">--skip-stat</code>. The
repository regeneration will take only a couple of seconds
(<a href="https://github.com/rpm-software-management/createrepo_c/commit/437451f3bea5430c0a6f678b2a65ebbbbcb12de0">437451f</a>).</p>

<h2 id="takeaway-for-appstream-builder">Takeaway for appstream-builder</h2>

<p>On the other hand, <code class="language-plaintext highlighter-rouge">appstream-builder</code> takes more than 20 minutes to
finish, and we didn’t find any way to make it run faster. As a
(hopefully) temporary solution, we added a possibility to disable
<a href="https://www.freedesktop.org/software/appstream/docs/">AppStream</a> metadata generation for a given project
(<a href="https://pagure.io/copr/copr/pull-request/742">PR#742</a>), and recommend owners of large projects to do so.</p>

<p>From the long-term perspective, it may be worth checking out whether
there are some possibilities to improve the <code class="language-plaintext highlighter-rouge">appstream-builder</code>
performance. If you are interested, see upstream issue
<a href="https://github.com/hughsie/appstream-glib/issues/301">#301</a>.</p>

<h2 id="takeaway-for-copr">Takeaway for Copr</h2>

<p>The month of September turned into one big stress test, causing Copr
to be temporarily incapacitated but helping us provide a better
service in the long-term. Because we never had such a big project in
the past, we experienced and fixed several issues in the UX and data
handling on the frontend and backend. Here are some of them:</p>

<ul>
  <li>Due to periodically logging all pending builds, the apache log
skyrocketed to 20GB and consumed all available disk space
(<a href="https://pagure.io/copr/copr/pull-request/1916">PR#1916</a>).</li>
  <li>Timeouts when updating project settings (<a href="https://pagure.io/copr/copr/pull-request/1968">PR#1968</a>)</li>
  <li>Unfair repository locking caused some builds unjustifiably long to
be finished (<a href="https://pagure.io/copr/copr/pull-request/1927">PR#1927</a>).</li>
  <li>We used to delegate pagination to the client to provide a
better user experience (and honestly, to avoid implementing it
ourselves). This made listing builds and packages in a large project
either take a long time or timeout. We switched to backend
pagination for projects with more than 10 000 builds/packages
(<a href="https://pagure.io/copr/copr/pull-request/1908">PR#1908</a>).</li>
  <li>People used to scrap the monitor page of their projects but that
isn’t an option anymore due to the more conservative pagination
implementation. Therefore we added proper support for project
monitor into the API and <code class="language-plaintext highlighter-rouge">copr-cli</code> (<a href="https://pagure.io/copr/copr/pull-request/1953">PR#1953</a>).</li>
  <li>The API call for obtaining all project builds was too slow for large
projects. In the case of the <code class="language-plaintext highlighter-rouge">@rubygems/rubygems</code> project, we
managed to reduce the required time from around 42 minutes to 13
minutes (<a href="https://pagure.io/copr/copr/pull-request/1930">PR#1930</a>).</li>
  <li>The <code class="language-plaintext highlighter-rouge">copr-cli</code> command for listing all project packages was too slow
and didn’t continuously print the output. In the case of the
<code class="language-plaintext highlighter-rouge">@rubygems/rubygems</code> project, we reduced its time from around 40
minutes to 35 seconds (<a href="https://pagure.io/copr/copr/pull-request/1914">PR#1914</a>).</li>
</ul>

<h2 id="lets-build-more">Let’s build more</h2>

<p>To achieve such mass rebuild, no special permissions, proprietary
tools, or any requirements were necessary. Any user could have done
it. In fact, some of them already did.</p>

<ul>
  <li><a href="https://copr.fedorainfracloud.org/coprs/iucar/cran/">iucar/cran</a></li>
  <li><a href="https://copr.fedorainfracloud.org/coprs/g/python/python3.10/">@python/python3.10</a></li>
  <li><a href="https://pypi.org/">PyPI</a> rebuild is being worked on by
<a href="https://github.com/befeleme">Karolina Surma</a></li>
</ul>

<p>But don’t be fooled, Copr can handle more. Will somebody try
<a href="https://www.npmjs.com/">Npm</a>, <a href="https://packagist.org/">Packagist</a>, <a href="https://hackage.haskell.org/">Hackage</a>, <a href="https://www.cpan.org/">CPAN</a>,
<a href="https://elpa.gnu.org/">ELPA</a>, etc? Let us know.</p>

<p>I would suggest starting with <a href="https://docs.pagure.org/copr.copr/user_documentation.html#mass-rebuilds">Copr Mass Rebuilds
documentation</a>.</p>