Maciej
Koziński
Although the web cache
server should improve the access to the World Wide Web resources it
is often bamed for slowing down the connection to the world. There
are at leats these reasons for that:
- some people and
organizations dislike the web caching service, because it reduces
the number of to their web pages - in commercial environment that
seems to lower their prestige ;)
- some users try to
turn off web cache and they get web pages fetched faster; they does
not pay attention to the simple fact that the free bandwith is
available due to the other users using web cache ;)
- some web caching installations are
misconfigured and work slow :(
It means that it is not
enough to compile and install squid for web caching to work fine, but
you have to watch the service and improve it if necessary.
The general rule for tuning squid is to
reduce the number of activities. You can simply live wihout caching
some resources and having some information in your logs. You must
remember that:
- every little action costs something
- thing done locally are (sometimes) less expensive than spreaded across network
for squid by properly planning the configuration for the clients. The
clients should not go through the web cache while fetching the local
resources or other resources available via fast links. You will not
have any profit for caching the local resources while the number of
requests could be relatively large and could slow down the transfer
for other resources via squid. The example below shows how
to configure Netscape via proxy auto configuration file to avoid
behaviour described above:
var url;
var host;
function FindProxyForURL (url,host)
{
if (shExpMatch(url,"*.your.local.domain/*"))
{
return "DIRECT";
}
else
{
return"PROXY
my.squid.server.com:8080; " +"DIRECT";
}
};
ret = FindProxyForURL(url, host);
This will leave more RAM and hard disk
space, more CPU time and more socket for these connections that
really require caching.
You can configure squid in several ways to stop caching the objects, which are available at
low cost. First, you should use the always_direct directive in squid.conf
file. The argument should be as above - your local domain/network.
Construct proper access control list (ACL) e.g.:
acl local dst
xxx.xxx.xxx.0/255.255.255.0
the one above is for pseudo-C class
network (Ethernet segment with 254 effective IP addresses), or:
acl local dstdomain
your.local.domain
The second one is less
effective, because you should have involve DNS for each request to
check out existence of reverse (IP->fully qualified domain
name) DNS record. In fact there is one more trouble - you should
have all your clients registered into primary and reverse DNS - if
you forget about that, unregistered client will not grant the
permission to use the web cache.
Having ACL local
defined you should declare next two rules in squid.conf:
no_cache deny local
always_direct allow
local
The first one forces
squid to never cache object from specified ACL and immediately
removes objects matching this ACL from cache. The second one forces
squid to redirect the request to origin server - the squid will avoid
fulfilling the local requests.
These simple configuration tricks
save a lot of useful resources from unnecessary spending. These
setting are necessary if you have many old clients, which are not
able to be configured via proxy auto configuration (shown
above).
First see http://squid.nlanr.net/Squid/FAQ/FAQ-11.html#ss11.17
to prevent slowing down the squid installation. This is done by too
large hard disk space in comparison to RAM available to squid.
The general rule is to dedicate as much RAM as possible to squid. The reason is simple:
retrieving objects kept in RAM cache is much faster than retrieving
them from disk. When having many requests simultaneously, the RAM is
used for buffering incoming and outgoing data - so having much RAM in
large installation is required.
|
Regarding to the fact that memory hit ratio is rather small (as I had measured it is below
10%), the performance of squid strongly depends on the disk setup.
Having the cache directories distributed among many disks improves
the performance, as the NLANR
research proved. It is also good idea to have more than one disk
controller, hard disks connected to different controllers and cache
directories spread among these disks. This could be default for PC's
having more than one disk - they are now equipped with two (E)IDE
controllers. Having the PC with Linux you can also deal with the
program called hdparm,
which could improve performance of your disks.
Having squid
compiled and configured with async I/O threads seems to be
also good idea. You will need libthread.a library -
fortunately, it is now available in almost all modern Uni*es. This
could be achieved by configuring squid before compiling this
way:
./configure
--enable-async-io
You should carefully
balance with your hit ratio and speed - remember, that cache with big
spool directory has usually very good hit ratio, but it's also
usually slow. I suppose that even the squid does not need much
CPU power in general, it requires al lot of power while processing
the request and looking for object in it's database - when the spool
grow up, processing each request consumes precious time - it is
clearly visible on slower machines. I have had observed this at
Polish POL34 w3cache hierarchy - the fastest cooperating machine was
the one after disk crash - with almost empty spool.
Due to the fact that
Squid requires a lot of power in very small periods of time,
it is probably good idea to give the highest priority to it at
startup. I have done that altering the RunCache script:
/usr/bin/nice -n -20
squid -Y $conf >> $logdir/squid.out 2>&1
This way of calling nice
is useful with GNU nice (Linux, FreeBSD?) and also for Solaris 2.x.
Why the highest priority? Web is almost interactive and waiting for
appearing the next Web page is probably one of the most annoying
things while working with Internet. To be comparable sometimes even
with direct connections squid should react as soon as possible
for requests. Giving it highest priority for very short period when
it requires it could make it running faster. Another interesting
concept has been given to me by Piotr
Auksztulewicz - he had used plock() for stopping squid
process and data from swapping between RAM and virtual memory pools.
It could be done by editing src/main.c and adding to the
includes:
#include
and add to the beginning of the the
function main() this call:
plock (PROCLOCK);
This will keep whole squid process
in RAM all the time.
Logging
Logging is useful thing for gathering
statistics and debug info, but it is sometimes awful, specially in
workhorse installations. Look at calling the main process:
/usr/bin/nice -n -20
squid -Y $conf >> $logdir/squid.out 2>&1
I had cut out the -s parameter -
this is unnecessary spending of time and resources when you don't use
the separate log server with syslog. Everything you want to
know about squid you could find in it's own logfiles. It
is usually also unnecessary to log ICP queries to the
access.log - I don't know exactlly how much could it slow down
squid itself, but it slows down logfile processing for sure
:)) Turn it of in yor squid.conf this way:
log_icp_queries off
You should also turn off or limit ident
loopkups for your clients if there is no certain need to do this.
Squid has it turned off by default.
IDENT server
Many
web sites will check out your ident before sending a document to you.
That means - when you don't have and ident server on your
machine - waiting for timeout before getting the Web page or spending
additional resources for respawning ident copies.
My
solution is to get free pident by Peter Eriksson
<pen@lysator.liu.se>
and
compile it with thread support (again you will need libthread.a
library) and run it from startup scripts, not from the inetd.conf.
This would make one copy of identd resident in RAM and processing all
the requests without forking and assosiated overhead. Get the pidentd
from ftp://ftp.lysator.liu.se/pub/ident/servers/
.
Squid now has own
caching nameserver. But it still needs to get DNS answer quickly
during the initial connection.
If you are focused on speeding up fulfilling your web requests - use CD.
Looking up the content in memory is much, much faster than querying
via network remote host, waiting for remote lookups and all responses
to decide where to get from. As I have measured ICP and CD with
squeezer
CD responses in similar conditions (link, workload) are several times
faster than ICP.
This
document is under constant construction and it is compilation of many
people's observations and ideas. If you have any question, flame,
suggestion or idea how to speed up web caching or you want to discuss
issues described above - contact me:
|