|
|
Subscribe / Log in / New account

A look at package repository proxies

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

February 13, 2009

This article was contributed by Nathan Willis

For simplicity's sake, I keep all of my general-purpose boxes running the same Linux distribution. That minimizes conflicts when sharing applications and data, but every substantial upgrade means downloading the same packages multiple times — taking a toll on bandwidth. I used to use apt-proxy to intelligently cache downloaded packages for all the machines to share, but there are alternatives: apt-cacher, apt-cacher-ng, and approx, as well as options available for RPM-based distributions. This article will take a look at some of these tools.

The generic way

Since Apt and RPM use HTTP to move data, it is possible to speed up multiple updates simply by using a caching Web proxy like Squid. A transparent proxy sitting between your LAN clients and the Internet requires no changes to the client machines; otherwise you must configure Apt and RPM to use the proxy, just as you must configure your Web browser to redirect its requests. In each case, a simple change in the appropriate configuration file is all that is required: /etc/apt/apt.conf.d/70debconf or /etc/rpmrc, for example.

Although straightforward, this technique has its drawbacks. First, a Web proxy will not recognize that two copies of a package retrieved from different URLs are identical, undermining the process for RPM-based distributions like Fedora, where the Yum update tool incorporates built-in mirroring.

Secondly, using the same cache for packages and all other HTTP traffic risks overflowing the cache. Very large upgrades — such as changing releases rather than individual package updates — can fill up the cache used by the proxy, and downloaded packages can get pushed out of the way by web traffic if your LAN upgrade process takes too much time. It is better to keep software updates and general web traffic separate.

Apt-proxy versus apt-cacher

The grand-daddy of the Apt caching proxies is apt-proxy. The current revision is written in Python and uses the Twisted framework. Complaints about apt-proxy's speed, memory usage, and stability spawned the creation of apt-cacher, a Perl-and-cURL based replacement that can run either as a stand-alone daemon or as a CGI script on a web server. Both operate by running as a service and accepting incoming Apt connections from client machines on a high-numbered TCP port: 9999 for apt-proxy, 3142 for apt-cacher.

Apt-proxy is configured in the file /etc/apt-proxy/apt-proxy-v2.conf. In this file, one sets up a section for each Apt repository that will be accessed by any of the machines using the proxy service. The syntax requires assigning a unique alias to each section along with listing one or more URLs for each repository. On each client machine, one must change the repository information in /etc/apt/sources.list, altering each line to point to the apt-proxy server and the appropriate section alias that was assigned in /etc/apt-proxy/apt-proxy-v2.conf.

For example, consider an apt-proxy server running on 192.168.1.100. If the original repository line in a client's sources.list is:

    deb http://archive.ubuntu.com/ubuntu/ intrepid main

It would instead need to read:

    deb http://192.168.1.100:9999/ubuntubackend intrepid main

The new URL points to the apt-proxy server on 192.168.1.100, port 9999, and to the section configured with the alias ubuntubackend. The apt-proxy-v2.conf file would contain an entry such as:

    [ubuntubackend]
    backends = http://archive.ubuntu.com/ubuntu/

If you find that syntax confusing, you are not alone. Apt-proxy requires detailed configuration on both the server and client sides: it forces you to invent aliases for all existing repositories, and to edit every repository line in every client's sources.list.

Apt-cacher is notably simpler in its configuration. Although there are a swath of options available in apt-cacher's server configuration file /etc/apt-cacher/apt-cacher.conf, the server does not need to know about all of the upstream Apt repositories that clients will access. Configuring the clients is enough to establish a working proxy. On the client side, there are two options: either rewrite the URLs of the repositories in each client's sources.list, or activate Apt's existing proxying in /etc/apt/apt.conf. But choose one or the other; you cannot do both.

To rewrite entries in sources.list, one merely prepends the address of the apt-cacher server to the URL. So

    deb http://archive.ubuntu.com/ubuntu/ intrepid main

becomes:

    deb http://192.168.1.100:3142/archive.ubuntu.com/ubuntu/ intrepid main

Alternatively, leave the sources.list untouched, and edit apt.conf, inserting the line:

    Acquire::http::Proxy "http://192.168.1.100:3142/";

Ease of configuration aside, the two tools are approximately equal under basic LAN conditions. Apt-cacher does offer more options for advanced usage, including restricting access to specific hosts, logging, rate-limiting, and cache maintenance. Both tools allow importing existing packages from a local Apt cache into the cache shared by all machines.

Much of the criticism of the tools observed on mailing lists or web forums revolves around failure modes, for example whether Twisted or cURL is more reliable as a network layer. But there are telling discussions from experienced users of both that highlight differences you would rather not experience firsthand.

For example, this discussion includes a description of how apt-proxy's simplistic cache maintenance can lose a cached package: If two clients download different versions of the same package, the earlier downloads will expire from the cache because apt-proxy does not realize that keeping both versions is desirable. If you routinely test unstable packages on one but not all of your boxes, such a scenario could bite you.

Other tools for Apt

Although apt-proxy and apt-cacher get the most attention, they are not the only options.

Approx is intended as a replacement for apt-proxy, written in Objective Caml and placing an emphasis on simplicity. Like apt-proxy, client-side configuration involves rewriting the repositories in sources.list. The server side configuration is simpler, however. Each repository is re-mapped to a single alias, with one entry per line.

Apt-cacher-ng is designed to serve as a drop-in replacement for apt-cacher, with the added benefits of multi-threading and HTTP pipelining lending it better speed. The server runs on the same TCP port, 3142, so transitioning from apt-cacher to apt-cacher-ng requires no changes on the client side. The server-side configuration is different, in that the configuration can be split into multiple external files and incorporate complicated remapping rules.

Apt-cacher-ng does not presently provide manpage documentation, supplying instead a 14-page PDF. Command-line fans may find that disconcerting. Neither application has supplanted the original utility it was designed to replace, but both are relatively recent projects. If apt-proxy or apt-cacher don't do the job for you, perhaps approx or apt-cacher-ng will.

Tools for RPM

The situation for RPM users is less rosy. Of course, as any packaging maven will tell you, RPM and Apt are not proper equivalents. Apt is the high-level tool for managing Debian packages with dpkg. A proper analog on RPM-based systems would be Yum. Unfortunately, the Yum universe does not yet have dedicated caching proxy packages like those prevalent for Apt. It is not because no one is interested; searching for the appropriate terms digs up threads at Linux Users' Group mailing lists, distribution web forums, and general purpose Linux help sites.

One can, of course, use Apt to manage an RPM-based system, but in most cases the RPM-based distributions assume that you will use some other tool designed for RPM from the ground up. In such a case, configuring Apt is likely to be a task left to the individual user, as opposed to a pre-configured Yum setup.

Most of the proposed workarounds for Yum involve some variation of the general-purpose HTTP proxy solution described above, using Squid or http-replicator. If you take this road, it is possible to avoid some of the pitfalls of lumping RPM and general web traffic into one cache by using the HTTP proxy only for package updates. Just make sure that plenty of space has been allocated for the cache.

Alternatively, setting up a local mirror of the entire remote repository, either with a tool such as mrepo, or piecemeal is possible. The local repository can then serve all of the clients on the LAN. Note, however, that this method will maintain a mirror of the entire remote repository, not just the packages that you download, and that you will have to update the machine hosting the mirror itself in the old-fashioned manner.

Finally, for the daring, one other interesting discussion proposes faking a caching proxy by configuring each machine to use the same Yum cache, shared via NFS. Caveat emptor.

I ultimately went with apt-cacher for this round of upgrades, on the basis of its simpler configuration and its widespread deployment elsewhere. Thus far, I have no complaints; the initial update went smoothly — Ubuntu boxes moving from 8.04 to 8.10, for the curious. The machines are now all in sync; time will tell whether or not additional package updates will reveal additional problems in the coming months. It's a good thing there are alternatives.


Index entries for this article
GuestArticlesWillis, Nathan


(Log in to post comments)

A look at package repository proxies

Posted Feb 13, 2009 22:48 UTC (Fri) by johill (subscriber, #25196) [Link]

I like apt-zeroconf, it still works if a machine is down and doesn't need any real setup thoughts.

A look at package repository proxies

Posted Feb 13, 2009 23:51 UTC (Fri) by sspans (guest, #43276) [Link]

Indeed, apt-zeroconf seems like the most hassle free of all, especially for home setups where security doesn't really matter. In a server environment I ended up going for a mdns-repository (ubuntu.local) served by two hosts. This required a avahi hack (adding a cname to myhostname.local) but it certainly works. Both servers had a complete ubuntu repository. The fact that a full repository is only 200 something gb, and that the dutch ubuntu-mirror was on the same switch kinda helped.

A look at package repository proxies

Posted Feb 14, 2009 20:29 UTC (Sat) by johill (subscriber, #25196) [Link]

As far as I know apt-zeroconf doesn't really affect security at all -- you still download package lists from the server and then verify the package signatures anyway, so a rogue apt-zeroconf 'cache' wouldn't do any damage unless you were ignoring package signature failures.

But if the other version works good for you by all means use it! I just have a bunch of machines that are mostly off and I don't care about downloading more when the alternative would be to walk over to another house and switch on a computer ;)

Another repo cacher for deb and rpm distros

Posted Feb 13, 2009 22:56 UTC (Fri) by dowdle (subscriber, #659) [Link]

Something to add to your list is pkg-cacher done by Robert Nelson. pkg-cacher is a modified version of Debian's apt-cacher that works with .rpm and .deb packages. He made it as a helper application to vzpkg2... something he also wrote as a replacement to OpenVZ's venerable vzpkg application.

vzpkg2 and pkg-cacher are used to build OpenVZ OS Templates by dragging packages from the distro repos into a local cache.

pkg-cacher has a number of unique features especially since it can do both rpm and deb repos so anyone needing such a tool should check it out. It can be a stand-alone service or a CGI application.

http://gforge.opensource-sw.net/projects/pkg_cacher/

Another repo cacher for deb and rpm distros

Posted Feb 16, 2009 9:10 UTC (Mon) by cyperpunks (subscriber, #39406) [Link]

I have used pkg-cacher as proxy for several months now. Works great. pkg-cacher is more or less apt-cacher for yum.

A look at package repository proxies

Posted Feb 13, 2009 23:19 UTC (Fri) by yokem_55 (subscriber, #10498) [Link]

For my gentoo machines, my "proxy" is nothing more than exporting my distfiles & portage tree via nfs off one master machine, which saves a lot of diskspace as well as bandwidth. Is there something about Fedora that makes this kind of "proxy" impractical?

rpm/yum package repos via NFS

Posted Feb 13, 2009 23:28 UTC (Fri) by dowdle (subscriber, #659) [Link]

The /etc/yum.conf specifies what directory to look for packages in and although I haven't done it myself, I'm sure NFS mounting an updates repo dir on said directory would make each client machine happy. They'd still have to contact the repo servers for the metadata but when they checked what needed to be downloaded, they'd find everything already sitting there.

Another way would be just to mount things over NFS somewhere and then use file:// references in the .repo defs rather than http://. In that case the NFS mount would be used for both packages and repo metadata.

rpm/yum package repos via NFS

Posted Feb 14, 2009 0:59 UTC (Sat) by JoeBuck (subscriber, #2330) [Link]

I don't think that yum's locking works correctly with a shared NFS mount for the package archive. Checking yum.pid to see if the process with the lock is still alive won't work right.

On the other hand, if yum commands are run in such a way that no two machines are running yum at the same time, things should be fine.

A look at package repository proxies

Posted Feb 14, 2009 7:38 UTC (Sat) by tzafrir (subscriber, #11501) [Link]

One obvious issue (at least for apt/dpkg) is that this will not produce a signed repository.

A look at package repository proxies

Posted Feb 14, 2009 19:07 UTC (Sat) by jwb (guest, #15467) [Link]

I don't understand. If I mount /var/cache/apt/archives from a remote system, I see no problems with the package signatures.

A look at package repository proxies

Posted Feb 14, 2009 22:15 UTC (Sat) by drag (guest, #31333) [Link]

Ya...

With Debian I believe the package list is signed and the package list contains checksums of all the packages. So as long as the checksums match up with the packages then it should not matter any.

-------------------

With Debian I just used approx. A cach'ng proxy seems the obvious way to go and it does not involve setting up any network shares or anything like that.

I frequently do temporary installs and VMs on various pieces of hardware for various reasons. When doing a network install having the ability to simply direct the installer to use a http://system.name:9999/etc is a HUGE time saver. On my work's corporate network it goes through a proxy which either is somewhat broken or gives very low priority to large files being downloaded... so it can take a hour or two to download a single 30 meg package or whatnont, depending on how busy the network is. Having a nice and easy to use proxy that doesn't require anything special is a big deal for me.

This is one of the things I really miss when using Fedora.

NFS as a cache

Posted Feb 18, 2009 6:16 UTC (Wed) by pjm (guest, #2080) [Link]

One issue is handling the case that multiple machines try to install something at the same time: Ideally you'd allow multiple machines to upgrade simultaneously but not download the same file twice. I believe none of apt/yum/... do per-file locking in the NFS-shared directory as this would require, whereas most other suggestions here do have the desired property. (See also other people's comments on locking in this thread.)

Deletion is another issue: if some machines are configured to use bleeding edge versions of things while others take the "better the bugs you know about" approach, then they'll have different ideas of when it's OK to delete a package from the cache. For that matter, apt will by default delete package lists that aren't referenced by its sources.list configuration file, which would be bad if different machines have different sources.list contents, so you'd want to add a APT::Get::List-Cleanup configuration entry on all your client machines to prevent this — and you'd then manually remove package-list files.

A very minor issue is that a per-machine cache is occasionally useful when the network is down (for the same reasons that apt/yum/... keep a local cache at all); though conversely there are some benefits (du, administration) in avoiding multiple caches.

I'd expect NFS to be slightly less efficient than the alternatives, but this shouldn't be noticeable.

A look at package repository proxies

Posted Feb 14, 2009 1:35 UTC (Sat) by pabs (subscriber, #43278) [Link]

Anyone know if any of these support debdelta or the RPM equivalent of that?

A look at package repository proxies

Posted Feb 14, 2009 4:03 UTC (Sat) by rahulsundaram (subscriber, #21946) [Link]

Not sure any of them support DeltaRPM but refer

http://fedoraproject.org/wiki/Releases/FeaturePresto

Also consider looking at Spacewalk. Support is planned to be added soon.
https://fedorahosted.org/spacewalk/wiki/DeltaRpmSupport

A look at package repository proxies

Posted Feb 14, 2009 2:49 UTC (Sat) by smithj (guest, #38034) [Link]

Spacewalk, the project name for the recently-open-sourced RHN Satellite, includes a proxy. http://www.redhat.com/spacewalk/

I don't think it is considered "stable" yet, though.

A look at package repository proxies

Posted Feb 16, 2009 9:37 UTC (Mon) by hppnq (guest, #14462) [Link]

Minor nitpick: Spacewalk is the upstream project for the RHN Satellite product.

A look at package repository proxies

Posted Feb 14, 2009 15:07 UTC (Sat) by nlucas (subscriber, #33793) [Link]

Thanks for the article.
It was one of those things that I wanted to investigate for a long time but never did for one or other reason.
Now all my home PCs use apt-cacher, using xinetd on the server (I liked the fact you only needed to add an apt.conf line to the clients, and let the apt.sources alone).

A look at package repository proxies

Posted Feb 15, 2009 5:13 UTC (Sun) by pabs (subscriber, #43278) [Link]

Did Fedora send those bsdiff speed improvements upstream?

Good to see Fedora has binary-diff updates as well as Debian.

A look at package repository proxies

Posted Feb 15, 2009 9:13 UTC (Sun) by job (guest, #670) [Link]

Isn't the obvious way given a large network with the same architecture to run apt against an internal mirror?

In a more ad-hoc network you could just share /var/cache/apt over NFS, and let apt handle the caching. What are the drawbacks of that compared to running a caching proxy?

A look at package repository proxies

Posted Feb 16, 2009 0:28 UTC (Mon) by maney (subscriber, #12630) [Link]

Well, it's still kind of wasteful. At least IME I never install more than a modest fraction of all the packages at all, but if you have enough same-arch machines it could still be a win. I'd have to have copies of three releases here (32 and 64 bit x86 of Ubuntu (and let's not forget that there are machines on different releases a fair part of the time as well) and x86 Debian), nah, it would be silly... for me.

Sharing apt's cache over NFS might run into concurrency issues, but if you're the only one who admins any of the boxes, okay, that probably can work. (I remember setting up NFS for no reason but to be able to run the Slackware install without having to have a great pile of floppies...) One thing I don't believe that addresses at all is purging obsolete versions of packages as security and bug-fixes roll in.

SQUID!

Posted Feb 15, 2009 19:16 UTC (Sun) by leonov (guest, #6295) [Link]

I use Squid Proxy on my gateway, and just make sure all my boxes are set to use the same (local) mirror. As I upgrade manually, there are no race conditions to worry about and it works very well, assuming fairly relaxed size limits on the Squid Cache...

SQUID!

Posted Feb 20, 2009 9:36 UTC (Fri) by NightMonkey (subscriber, #23051) [Link]

A solution for Gentoo I've used for years is "http-replicator". It hums along quietly, with nary a hiccup (except when I haven't cleaned the cache out in a while...). http://sourceforge.net/projects/http-replicator

RPM: IntelligentMirror

Posted Feb 15, 2009 22:26 UTC (Sun) by mdomsch (guest, #5920) [Link]

IntelligentMirror, found at
https://fedorahosted.org/intelligentmirror/wiki/Intellige...,
aims to solve the problem of caching RPMs that may be identical yet come from different servers and thus have different URLs.

RPM: IntelligentMirror

Posted Feb 16, 2009 2:44 UTC (Mon) by skvidal (guest, #3094) [Link]

I'd also like to mention that intelligent mirror is a dedicated package caching solution for yum. It was developed for yum in GSoC.

RPM: IntelligentMirror

Posted Dec 24, 2010 13:43 UTC (Fri) by lbt (subscriber, #29672) [Link]

Thanks for the article :)

For those that google brings here:

FYI : Later versions of Intelligentmirror are just a trivial 2-3 line regexp substitution with about 30k of cut'n'paste code and quite a few python libraries (including yum) as dependencies that it uses to parse a 4-line config file.

This is actually the meat of it:
http://www.squid-cache.org/Doc/config/storeurl_rewrite_pr...

For the record I found my squid2.7 config needed:
798c1798
< # cache_replacement_policy lru
---
> cache_replacement_policy heap LFUDA
1988c1988
< # maximum_object_size 20480 KB
---
> maximum_object_size 80480 KB
2749a2751
> refresh_pattern -i \.(deb|rpm|zip|tar|gz|bz2)$ 259200 90% 259200 override-expire ignore-no-cache ignore-private reload-into-ims ignore-reload
2784c2786
< # quick_abort_min 16 KB
---
> quick_abort_min -1 KB
4948a4951,4958
> #### BEGIN Add to squid.conf ####
> storeurl_rewrite_program /usr/bin/python /etc/squid/intelligentmirror/intelligentmirror.py
> storeurl_rewrite_children 3
> acl store_rewrite_list urlpath_regex -i .rpm$
> acl store_rewrite_list urlpath_regex -i .deb$
> storeurl_access allow store_rewrite_list
> storeurl_access deny all
> #### END Add to squid.conf ####

A look at package repository proxies

Posted Feb 16, 2009 10:08 UTC (Mon) by cglass (guest, #52152) [Link]

No mention of apt-mirror?
Is it so obscure that I'm the only one using it or does it have any significant disadvantage I'm not aware of (disk space is not an issue for me)?

A look at package repository proxies

Posted Feb 19, 2009 17:57 UTC (Thu) by kov (guest, #7423) [Link]

The article says:

"First, a Web proxy will not recognize that two copies of a package retrieved from different URLs are
identical, undermining the process for RPM-based distributions like Fedora, where the Yum update
tool incorporates built-in mirroring."

This is not really true. Squid can be setup to normalize URLs, so that the same package will be found
when two different URLs are used. I used to maintain a transparent proxy using that trick to handle
my colleagues using different Debian mirrors.

A look at package repository proxies

Posted Feb 22, 2009 22:34 UTC (Sun) by mdz@debian.org (guest, #14112) [Link]

Surely you mean /etc/apt/apt.conf rather than /etc/apt/apt.conf.d/70debconf?

A look at package repository proxies

Posted Feb 24, 2009 21:32 UTC (Tue) by dlang (guest, #313) [Link]

that depends on the version you are running. I recently ran into the apt.conf.d directory stuff on newer debian systems, apparently they are trying to fragment the config out into many seperate files instead of having it in one file (my guess is they think there will be a library of options to just copy into the directory)

A look at package repository proxies

Posted Mar 30, 2009 8:11 UTC (Mon) by mdz@debian.org (guest, #14112) [Link]

/etc/apt/apt.conf.d allows packages to supply apt configuration data without messy editing of config files. It's a common pattern in Debian.

In general, administrators should (continue to) use /etc/apt/apt.conf to provide their own settings. You should only edit 70debconf if you need to change the settings which are provided by debconf.

A look at package repository proxies

Posted Feb 26, 2009 19:07 UTC (Thu) by ruzam (guest, #56872) [Link]

Repository Proxy

http://freshmeat.net/projects/repo-proxy/

I've been saving my bandwidth (and keeping all the house computers up to date) with this for over a year now.


Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds