V5 development progress - Incremental indexing
Requirements
------------
First of all, incremental indexing is only available for the PHP, ASP, and CGI versions. It is not available for the Javascript version since it is incapable of indexing a large enough set of files where incremental indexing would be beneficial.
Second, in order to use incremental indexing, you must NOT have modified your indexing configuration since the last index was made. The ZCFG file must contain the exact same settings, and the index files must still be in the output folder specified.
"Update existing index"
------------------------
This option will look through the list of pages found in your existing index and check if they have since been modified. It will then perform a partial index of only the pages that have changed (and potentially index any new pages that you have added links to).
Note that there are some limitations to this, and that with each subsequent update, the index gets larger and less efficient. We recommend performing a full re-index regularly where possible (perhaps once a week, or once a month, depending on how often you perform a partial index).
Note also that the ability for Zoom to determine whether a file was modified is dependent entirely on the last-modified date retrieved and the filesize. If these attributes are inaccurate or do not represent the changes to the file, then it will not be able to accurately find the files which have been changed.
"Add start points to existing index"
-------------------------------------------------
This option allows you to add and index a list of start points (usually a new website, or a part of a new website) to an existing index. This can be useful if you manage a list of websites as start points and you wish to add new start points to the index on a regular basis.
It will index the new start point, append this data to the existing index (without having to re-index the existing start points) and save the configuration with your added start points (so that on your next full re-index, the new start points will be included).
"Add list of new or updated pages"
-------------------------------------------------
This feature allows you to specify a list of new pages which are to be indexed and added to the existing index. If you specify a page here which already exists in the index, Zoom will assume that this page has been updated/modified, and will remove the old data for this page, and add the new one.
"View or delete pages from existing index"
-------------------------------------------------
This allows you to browse the list of pages which exist in your current index. It also allows you to mark certain pages for deletion - removing them from the searchable content. Note that deleting pages using this function will NOT decrease the size of your index files.
To summarize regarding the effect of these features on your existing index:
- Adding new pages do not compromise the efficiency of an existing index
- Updating and removing pages causes an existing index to become progressively less efficient (as more pages are removed/updated).
Although we don't have any good benchmark figures to post. If you have updated more than 20% of the pages in your index, you should do a full re-index to 'defrag' the index.
-------------------------------------------------
This feature allows you to specify a list of new pages which are to be indexed and added to the existing index. If you specify a page here which already exists in the index, Zoom will assume that this page has been updated/modified, and will remove the old data for this page, and add the new one.
Hello, does this mean that say on my server if I update the page storm.html, I can add this page using the method above. Am I correct in thinking Zoom will remove storm.html and all its info and then add it again to the index?
Does this mean no performance loss?
Thank you
You can just add the additional page.
Update existing index - If the site im indexing is dynamic (with no date or meta info), what does zoom do? Does it skip the file and does not update it (matching it with the URL) OR does it delete the file from the index and add it anyway OR does it do something else?
If a page does not return filesize or date information, Zoom will presume that it is dynamic and has changed and it will update the file.
If a site does not contain any date or filesize information, we would not recommend using the Incremental Update feature (since all pages will need to be removed and re-indexed, so you would be better off doing a full proper re-index).
Does a command-line command autostart zoom? (I assume it does)
Yes. All command-line features autostart Zoom. But you will need to specify the -s or -o or -r commands to autostart Zoom in either spider, offline, or report mode respectively. See the existing Users Guide regarding these autostart commands.
If yes, does it auto-close zoom upon completion?
Yes.
If zoom is already open what happens?
A new instance of Zoom will start and perform the operation you specified.
If zoom is in the middle of an operation and a command is executed, what will happen?
See above.
I intend to host zoom remotely on a shared / dedicated server, what are hosting requirements / any tips?
Your server must run Windows and meet the System Requirements for running the Zoom Indexer application.
Try to schedule/run the indexing during low load periods on your server.
Most shared hosting solutions will not allow you to run a native Windows application on the server. You may need to have a dedicated (or your own hosted) server to do this.
However, some things are still unclear.
**These questions are regarding the CLI***
Add start points - If I have a start point already in the cfg file, say google.com and I want to add a single page to the index from another domain, do I have to add a start point first or can I just add the additional page?
Update existing index - If the site im indexing is dynamic (with no date or meta info), what does zoom do? Does it skip the file and does not update it (matching it with the URL) OR does it delete the file from the index and add it anyway OR does it do something else?
Does a command-line command autostart zoom? (I assume it does)
If yes, does it auto-close zoom upon completion?
If zoom is already open what happens?
If zoom is in the middle of an operation and a command is executed, what will happen?
I intend to host zoom remotely on a shared / dedicated server, what are hosting requirements / any tips?
Thank you for your time
------------------------------------------------
We've added a list of command-line parameters to Zoom that will allow you to call upon the above incremental indexing features via the command-line. This will allow developers to call Zoom to perform these operations via external scripts or applications (eg. you could have a server-side script which calls upon Zoom to add a new start point to an existing index when a user submits them via a webpage).
The new commands are:
-update
This will perform an incremental update (as described above) on the specified ZCFG file. You must also specify the index mode (offline or spider) and the config file like so:
ZoomIndexer.exe -s zoom.zcfg -update
-addpage
This will add a specific page to the existing index specified by the config file and index mode. eg.
ZoomIndexer.exe -s zoom.zcfg -addpage http://www.mywebsite.com/newpage.html
Note that if you are using offline mode, you will need to specify a base URL following the addpage URL with a pipe ("") character, eg.
ZoomIndexer.exe -o zoom.zcfg -addpage C:mywebsitenewpage.htmlhttp://www.mywebsite.com/
-addpages
This is the same as -addpage but allows you to specify a text file containing a list of new pages (rather than calling it for one page only). eg.
ZoomIndexer.exe -s zoom.zcfg -addpage newpages.txt
Similarly, offline mode will expect a base URL following the text filename (separated by a pipe character).
-addstartpt
This option will perform an incremental add start point operation on the specified config file and index mode. eg.
ZoomIndexer.exe -s zoom.zcfg -addstartpt http://www.mynewsite.com/
Offline mode will expect a base URL following the start directory (separated by a pipe character). eg.
ZoomIndexer.exe -o zoom.zcfg -addstartpt C:mynewwebsitehttp://www.mynewwebsite.com/
-addstartpts
This is the same as -addstartpt but allows you to specify a text file containing a list of start points.
In spider mode, the format of this text file is the same as the "Import start points" feature, which allows you to specify spidering options such as "index and follow" or "index only", etc. As well as allowing you to specify a Limit of the number of pages to index for each start point. See the chapter on "Importing and exporting additional URLs" in the Users Guide for more information.
-deletepage
This parameter will delete the specified page from the index as configured by the ZCFG file given and the index mode specified. eg.
ZoomIndexer.exe -s zoom.zcfg -deletepage http://www.mywebsite.com/oldnews.html
In the meantime, you could consider increasing the size of your web cache (in Windows/Internet Explorer which shares a common cache) and allowing Zoom to use the cache to minimize web traffic.
When using incremental indexing with command-line, can I use a remote url to the cfg file?
Thanks
One more quick question. Is it possible to 'defrag' on the local machine instead of having to re-index (theoretically all of the data is in the files, it just needs repacking). I'm asking this because a) It will be quicker b) saves bandwidth and some external sites which I index can't take the extra load.
AG!
#If you have any other info about this subject , Please add it free.# |

