UntzUntz LAN Scan
				-----------------
				   Version 1.3c
Contained In This File:

	1. Purpose
	2. Why
	3. Requirements
	4. Installation
	5. Problems & Questions
	6. Current Features and Information


Purpose:
-------
	This program will use 'smbclient' to creating a listing of all files 
contained within the local area network (that are shared).  They can be 
shared on a Microsoft Windows machine or another Samba enabled machine.


Why:
---
	I wrote this because I setup a network where most people (about 35)
	shared a lot of files between each other.  Finding a song or file that
	is contained on maybe one machine out of 100,000 files - is a pain.
	So now! You can search!


Requirements:
------------
	The main requirement is the 'smbclient' program.  If you do not have
	Samba installed, go to http://www.samba.org to download it.

	You'll also need a web server running to use the CGI program
	Perhaps in the future another medium for searching could be developed.


Installation:
------------
	Compiling:

		g++ -O untzcgi-X.X.cpp -o untzcgi
		g++ -O untzlanscan-X.X.cpp -o untzlanscan


	Installation:

		Copy the 'untzcgi' program into your web server's CGI directory
			cp untzcgi /usr/local/apache/cgi-bin/			

		Copy the 'untzlanscan' program to /usr/local/bin
			cp untzlanscan /usr/local/bin

		Copy the 'untzls.conf' file to /etc/ directory
			cp untzls.conf /etc/

		You probably want to put this into your cron daemon so it will scan
		the network every once and a while.  Here is what I use:

		59 * * * * /usr/local/bin/untzlanscan

		This will run the scan every hour.


	Configuration:

		Edit the file, /etc/untzls.conf to reflect your configuration.

		Here's the skinny on the configuration options:

			ip address
				This option is the IP address of the master browser 
				on the network.  I'm pretty sure any computer will
				work but the master browser has a better chance of
				having all the computer listed.
	
			temp
				This is probably best set as /tmp
				Basically it is a place where the program creates some
				temporary files and holds the database.

			smbclient
				This is the path to 'smbclient' program.
				If you do not have this program you will need to
				download the Samba packet from http://www.samba.org

			logo jpeg
                                This is the URL to a small image file in the upper left
                                hand corner of the search screen.  Keep in mind there is
                                no size constraint in the IMG tag - so it might look kinda
                                wierd if you put a 640x480 size jpeg in there.

			username
				This is the name of the user smbclient will use to
				connect to systems.  If you have lax security the
				user 'guest' is probably your best bet.

			password
				OPTIONAL. If you are connecting to systems that require
				a user and a password, set the password option.
				One thing to keep in mind is that this password is held in
				clear text.  You may want to consider creating a user
				that only has READ and LIST options on the systems that
				this program will be connecting to.
				Default: None
							
			workgroup
				OPTIONAL. If given this will add this to the smbclient
				for use as the workgroup option.	
				Default: None

			results per page
				OPTIONAL. This is the configuration option to specify the
				number of results to print out per page.  If it is not
				given and not passes to the CGI script, it will default
				to 15.
				Default: 15

			allow user robots
				OPTIONAL. A value of yes or no (or YES/NO)
				This will specify whether you want to give the user the
				ability to block certain directories.
				Default: Yes, the user can specify his/her own robots.txt

			server robots
				OPTIONAL. This points to a file which specifies a list of 
				machines shares, and directories to block.  See under the
				'Usage' section on the format of the file.
				Default: None

			page header
				OPTIONAL. This points to a file that defines a header for
				each search results page.  For more information on the 
				variables that can be contained within this file, please	
				see the section 'Template' section below under 'Usage'

			page footer
				OPTIONAL. This points to a file that defines a footer for
				each search results page.  For more inf on the variables
				see below.

			search template
				OPTIONAL. This points to a file that defines a structure
				for each result.  See below for more details.

			force file
				OPTIONAL. This points to a file that tells the crawler to
				scan the shares contained. This can be used it for some
				reason the smb client cannot see a share or if the share
				name is being cut off by smbclient.  The format of the
				file is simply one    //COMPUTER/SHARE    per line.


Usage:
-----
	Search Form
	-----------
		Creating a main search web page is easy!
		Simply use the following code in your webpage:

			
As you can see, you only need to pass the QUERY variable to begin a search. RESULTS (cgi option) ------- If this is specified and is passed from the form it tells the CGI how many results to print out per page of results. It goes hand-in-hand with the configuration option 'results per page'. If it is not given and not specified in the configuration file, the default is 15. If you want to embed this in your form, simply add To your HTML within the tag. server robots ------------- If this directive is specified in the configuration program, the crawler will look through the file specified for machines, shares, and directories to not search. You ask, why would you want to do such a thing? Well, if you are, for example, sharing a directory tree which has all the install files to a program, you might want to block searching of the actual install tree because there are a lot of ambigious files that would have no use to a casual search. Blocking will also keep the size of the database small and therefore allow for faster lookups of worthwhile data. Each line is checked until there is a match made. At the end there is an mplicit 'allow all'. [action] [type] [data] action ------ This is what action to take, the options are: allow deny type ---- This is where to impliment the filter, the options are: machine share directory all * When using all it will block/allow on machine/share/directory data ---- This is the actual thing to deny/allow. It can use the * as a wildcard. Keep in mind that when it check this, it check on full path information. So, for example, if you wanted to hide a directory called 'Monkey' and it was in a directory called 'Data' you would want to use the following: deny directory \Data\Monkey* Examples -------- The following example is to deny certain areas: deny machine passout deny share top* deny share *secret deny directory *help* The first line, 'deny machine passout' will tell the crawler to not even look at the machine named 'passout'. The second line, 'deny share top*' will tell the crawler if it comes upon a share that begins with 'top', to not crawl it. The third line, 'deny share *secret' will tell the crawler if it comes upon a share that ends with 'secret' to not crawl it. The fourth line, 'deny directory *help*' tells the crawler if it sees a directory with the word 'help' in it, to not crawl it. This will also block all the subdirectories of that directory. You can also use the reverse logic to tell the program only to crawl certain machines, share, directories, for example: allow machine passout allow machine monkey deny all Tells the crawler only to search the machines named, 'passout' and 'monkey' robots.txt (user side) ---------- If you want the user to have the ability to block a share from being searched simply have them place a file called 'robots.txt' in the share's root directory. The crawler will see this and skip the directory If they want to block just a directory, have them place the 'robots.txt' file in that directory and it will skip that directory. It will not skip the subdirectories at this time. Templates --------- So you want to make your UntzUntz LAN Scan search fit into the rest of your website? Well then templates are the answer. Included are three example templates which are very similar to the default UntzUntz LAN Scan search results. One thing to mention is that templates are optional. If you do not define template files, UntzUntz LAN Scan will use the default built into the code. Let's get started. First, let's examine the configuration options: page header = This can be a filename (with path) or it can be the word 'none'. If it is set to none, then UntzUntz LAN Scan will not print anything out for the header. The equivalent would also be to set it to a file that doesn't exist, in which case UntzUntz LAN Scan will also print nothing for the header. page footer = Same as page header above. search template = This can only be a filename. If it is defined and the cannot be found UntzUntz LAN Scan will use the default template. Ok...now we've configured out /etc/untzls.conf file...let's make the templates: Each file has a different set of variables that can be used. Below is a table of variables and the files they work in. +----------+----------------+--------------------------------+ | File | Variable | Description | +----------+----------------+--------------------------------+ | Header | $SMALL_LOGO | This is the small logo defined | | | | in /etc/untzls.conf | | +----------------+--------------------------------+ | | $QUERY_DATA | This is what is the last query | | +----------------+--------------------------------+ | | $SEARCH_RESULTS| Listing of each keyword and the| | | | number of 'hits' for each | | +----------------+--------------------------------+ | | $FIND_SIZE | This is the size (in MB/GB/TB) | | | | of the files found | | +----------------+--------------------------------+ | | $TOTAL_SIZE | This is the total size of all | | | | files on the network. | | +----------------+--------------------------------+ | | $LOW_RESULTS | This is the lowest result on | | | | the page. | | +----------------+--------------------------------+ | | $HIGH_RESULTS | This is the highest result on | | | | the page. | | +----------------+--------------------------------+ | | $TOTAL_RESULTS | Total number of results | | +----------------+--------------------------------+ | | $SEARCH_TIME | Time in seconds it took to find| | +----------------+--------------------------------+ | | $PAGING_BAR | The list of result pages | +----------+----------------+--------------------------------+ | Footer | $PAGING_BAR | The list of result pages | | +----------------+--------------------------------+ | | $UNTZ_FOOTER | A small ad for UntzUntz LS | +----------+----------------+--------------------------------+ | Search | $NUMBER | Result number | | Template +----------------+--------------------------------+ | | $FILE_PATH | Computer\Share\Dir\file | | +----------------+--------------------------------+ | | $FILE_NAME | Name of file found | | +----------------+--------------------------------+ | | $FILE_SIZE | Size of file (in KB/MB/GB) | | +----------------+--------------------------------+ | | $FILE_DATE | Date of file | | +----------------+--------------------------------+ | | $FILE_TIME | Time of file | | +----------------+--------------------------------+ | | $FILE_LOCATION | Computer\Share\Dir | +----------+----------------+--------------------------------+ Notes: 1. $QUERY_DATA has the "" around it already Any HTML will work around these variables. Only one of each variable per line. Meaning you can have: $NUMBER. $FILE_NAME But not, $NUMBER, $NUMBER, $NUMBER And why would you want to? But if you really did want to: $NUMBER, $NUMBER, $NUMBER Which would be the same. Look at the example template files for a better idea of how they work. Problems or Questions? --------------------- One thing you can start with is changing the DEBUG_LEVEL variable Change the 0 to a number between 1 and 9. 1 will give the least amount of information, 9 the most. Otherwise, email me at jed204@users.sourceforge.net Current Features and Information: -------------------------------- Currently the program is configured through a configuration file. Within the configuration file is information about the master browser (or a computer on the network), a temporary directory, a logo file for the search page, a network user name to search as (probably guest for most networks). The crawler current will scan the network for clients, create a share listing, from the share listing create a file listing. There is support for 'robots.txt', if this file is in a 'root' directory of a share the share will be skipped. If it is in a 'lower level' directory of a share, that directory will be skipped. Currently the crawler will skip hidden and printer shares as well. Once the crawler has found all the files on the network the cgi program can do a search against that. The cgi takes less than 1 second to search through the 30,000 files in 2,500 directories on my network. I created a database with 350,000 and 30,000 directories and it took about 5 to 6 seconds. These benchmarks are from a Pentium III 450 MHz with 512 MB of ram. Also see the beginning of each source file for more information about development and releases. If you have any problems feel free to contact me at jed204@users.sourceforge.net Thanks, John jed204@users.sourceforge.net