UntzUntz LAN Scan
-----------------
Version 1.3c
Contained In This File:

1. Purpose
2. Why
3. Requirements
4. Installation
5. Problems & Questions
6. Current Features and Information

Purpose:
-------
This program will use 'smbclient' to creating a listing of all files
contained within the local area network (that are shared). They can be
shared on a Microsoft Windows machine or another Samba enabled machine.

Why:
---
I wrote this because I setup a network where most people (about 35)
shared a lot of files between each other. Finding a song or file that
is contained on maybe one machine out of 100,000 files - is a pain.
So now! You can search!

Requirements:
------------
The main requirement is the 'smbclient' program. If you do not have
Samba installed, go to http://www.samba.org to download it.

You'll also need a web server running to use the CGI program
Perhaps in the future another medium for searching could be developed.

Installation:
------------
Compiling:

g++ -O untzcgi-X.X.cpp -o untzcgi
g++ -O untzlanscan-X.X.cpp -o untzlanscan

Installation:

Copy the 'untzcgi' program into your web server's CGI directory
cp untzcgi /usr/local/apache/cgi-bin/

Copy the 'untzlanscan' program to /usr/local/bin
cp untzlanscan /usr/local/bin

Copy the 'untzls.conf' file to /etc/ directory
cp untzls.conf /etc/

You probably want to put this into your cron daemon so it will scan
the network every once and a while. Here is what I use:

59 * * * * /usr/local/bin/untzlanscan

This will run the scan every hour.

Configuration:

Edit the file, /etc/untzls.conf to reflect your configuration.

Here's the skinny on the configuration options:

ip address
This option is the IP address of the master browser
on the network. I'm pretty sure any computer will
work but the master browser has a better chance of
having all the computer listed.

temp
This is probably best set as /tmp
Basically it is a place where the program creates some
temporary files and holds the database.

smbclient
This is the path to 'smbclient' program.
If you do not have this program you will need to
download the Samba packet from http://www.samba.org

logo jpeg
This is the URL to a small image file in the upper left
hand corner of the search screen. Keep in mind there is
no size constraint in the IMG tag - so it might look kinda
wierd if you put a 640x480 size jpeg in there.

username
This is the name of the user smbclient will use to
connect to systems. If you have lax security the
user 'guest' is probably your best bet.

password
OPTIONAL. If you are connecting to systems that require
a user and a password, set the password option.
One thing to keep in mind is that this password is held in
clear text. You may want to consider creating a user
that only has READ and LIST options on the systems that
this program will be connecting to.
Default: None

workgroup
OPTIONAL. If given this will add this to the smbclient
for use as the workgroup option.
Default: None

results per page
OPTIONAL. This is the configuration option to specify the
number of results to print out per page. If it is not
given and not passes to the CGI script, it will default
to 15.
Default: 15

allow user robots
OPTIONAL. A value of yes or no (or YES/NO)
This will specify whether you want to give the user the
ability to block certain directories.
Default: Yes, the user can specify his/her own robots.txt

server robots
OPTIONAL. This points to a file which specifies a list of
machines shares, and directories to block. See under the
'Usage' section on the format of the file.
Default: None

page header
OPTIONAL. This points to a file that defines a header for
each search results page. For more information on the
variables that can be contained within this file, please
see the section 'Template' section below under 'Usage'

page footer
OPTIONAL. This points to a file that defines a footer for
each search results page. For more inf on the variables
see below.

search template
OPTIONAL. This points to a file that defines a structure
for each result. See below for more details.

force file
OPTIONAL. This points to a file that tells the crawler to
scan the shares contained. This can be used it for some
reason the smb client cannot see a share or if the share
name is being cut off by smbclient. The format of the
file is simply one //COMPUTER/SHARE per line.

Usage:
-----
Search Form
-----------
Creating a main search web page is easy!
Simply use the following code in your webpage:

As you can see, you only need to pass the QUERY variable to begin a
search.

RESULTS (cgi option)
-------
If this is specified and is passed from the form it tells the CGI how
many results to print out per page of results. It goes hand-in-hand
with the configuration option 'results per page'. If it is not given
and not specified in the configuration file, the default is 15.
If you want to embed this in your form, simply add

To your HTML within the tag.

server robots
-------------
If this directive is specified in the configuration program, the crawler
will look through the file specified for machines, shares, and directories
to not search. You ask, why would you want to do such a thing? Well, if
you are, for example, sharing a directory tree which has all the install
files to a program, you might want to block searching of the actual install
tree because there are a lot of ambigious files that would have no use to
a casual search. Blocking will also keep the size of the database small and
therefore allow for faster lookups of worthwhile data.

Each line is checked until there is a match made. At the end there is an
mplicit 'allow all'.

[action] [type] [data]

action
------
This is what action to take, the options are:
allow
deny

type
----
This is where to impliment the filter, the options are:
machine
share
directory
all

* When using all it will block/allow on machine/share/directory

data
----
This is the actual thing to deny/allow. It can use the * as a
wildcard. Keep in mind that when it check this, it check on full
path information. So, for example, if you wanted to hide a
directory called 'Monkey' and it was in a directory called 'Data'
you would want to use the following:

deny directory \Data\Monkey*

Examples
--------
The following example is to deny certain areas:

deny machine passout
deny share top*
deny share *secret
deny directory *help*

The first line, 'deny machine passout' will tell the crawler to
not even look at the machine named 'passout'.

The second line, 'deny share top*' will tell the crawler if
it comes upon a share that begins with 'top', to not crawl it.

The third line, 'deny share *secret' will tell the crawler if
it comes upon a share that ends with 'secret' to not crawl it.

The fourth line, 'deny directory *help*' tells the crawler if
it sees a directory with the word 'help' in it, to not crawl it.
This will also block all the subdirectories of that directory.

You can also use the reverse logic to tell the program only to
crawl certain machines, share, directories, for example:

allow machine passout
allow machine monkey
deny all

Tells the crawler only to search the machines named, 'passout'
and 'monkey'

robots.txt (user side)
----------
If you want the user to have the ability to block a share from being
searched simply have them place a file called 'robots.txt' in the
share's root directory. The crawler will see this and skip the directory
If they want to block just a directory, have them place the 'robots.txt'
file in that directory and it will skip that directory. It will not skip
the subdirectories at this time.

Templates
---------
So you want to make your UntzUntz LAN Scan search fit into the rest of
your website? Well then templates are the answer. Included are three
example templates which are very similar to the default UntzUntz LAN
Scan search results. One thing to mention is that templates are
optional. If you do not define template files, UntzUntz LAN Scan will
use the default built into the code.

Let's get started. First, let's examine the configuration options:

page header =
This can be a filename (with path) or it can be the word
'none'. If it is set to none, then UntzUntz LAN Scan
will not print anything out for the header. The equivalent
would also be to set it to a file that doesn't exist, in
which case UntzUntz LAN Scan will also print nothing for
the header.

page footer =
Same as page header above.

search template =
This can only be a filename. If it is defined and the
cannot be found UntzUntz LAN Scan will use the default
template.

Ok...now we've configured out /etc/untzls.conf file...let's make the
templates:

Each file has a different set of variables that can be used. Below is
a table of variables and the files they work in.

Notes:
1. $QUERY_DATA has the "" around it already

Any HTML will work around these variables. Only one of each variable
per line. Meaning you can have:

$NUMBER. $FILE_NAME

But not,

$NUMBER, $NUMBER, $NUMBER

And why would you want to? But if you really did want to:

$NUMBER,
$NUMBER,
$NUMBER

Which would be the same.
Look at the example template files for a better idea of how they work.

Problems or Questions?
---------------------
One thing you can start with is changing the DEBUG_LEVEL variable
Change the 0 to a number between 1 and 9. 1 will give the least amount of
information, 9 the most.

Otherwise, email me at jed204@users.sourceforge.net

Current Features and Information:
--------------------------------
Currently the program is configured through a configuration file. Within the
configuration file is information about the master browser (or a computer on the network),
a temporary directory, a logo file for the search page, a network user name to search as
(probably guest for most networks).
The crawler current will scan the network for clients, create a share listing,
from the share listing create a file listing. There is support for 'robots.txt', if
this file is in a 'root' directory of a share the share will be skipped. If it is in a
'lower level' directory of a share, that directory will be skipped.
Currently the crawler will skip hidden and printer shares as well. Once the
crawler has found all the files on the network the cgi program can do a search against
that. The cgi takes less than 1 second to search through the 30,000 files in 2,500
directories on my network. I created a database with 350,000 and 30,000 directories
and it took about 5 to 6 seconds. These benchmarks are from a Pentium III 450 MHz with
512 MB of ram.
Also see the beginning of each source file for more information about development
and releases.

If you have any problems feel free to contact me at jed204@users.sourceforge.net

Thanks,
John
jed204@users.sourceforge.net