WWWPS

world wide web positioning system.

Uses ngrams to weigh links against a search query and crawl towards higher rating hits.

Usage

The crawler - at the moment - exposes a JSON API with three endpoints, given a query "hello world" you can

Start crawling a query: curl -X PATCH http://localhost:1337/start?query=hello+world
Stop crawling a query: curl -X PATCH http://localhost:1337/halt?query=hello+world
Get results for a query: curl http://localhost:1337/search?query=hello+world

Rationale

Copied from the WWW page on my digital garden. Text there might differ.

So the starting assumption here is that the internet is perfectly fine and that useful/interesting websites are still there - it's just that the discoverability is not that good. Well, why not try and solve discoverability with a positioning system? If you think of search engines as maps that you use to navigate the web, think of WWWPS as a GPS which knows in which direction something is.

The user interface for a world wide web positioning system? A plaintext messageboard.

This is an example of immortal software - any moderately skilled programmer on the planet can easily build the following ~~two~~ three things in a relatively short amount of time:

a rudimentary messageboard with enumerated posts, 1 level replies, quoting, and pagination
a respectful crawler (waits for 10-30 seconds between each HTTP request) bundled together with a list of like top 1000 most common words and top 5-10 TLDs for starting points, an ngram (trigram) similarity comparison, and a gradually decreasing threshold for next most similar page to look at.
a broker with an (HTTP?) API for POST to /messages and GET /message and a FIFO CSV storage to act as an intermediary between the messageboard and however many crawler processes are needed

So you can make a post on the messageboard like

blah blah blah yadda yadda and Python makes sense as one of the world's most popular languages at the time of writing.

lookup: search engine development python

which the crawler picks up and as it finds increasingly better matches it replies to your post with relevant pages it finds. You could even micromanage it live by replying to yourself

ok let's narrow it down a bit

lookup: search engine development python
update: search engine development python, not tutorial

(a negative (see "not tutorial" part in the previous example) part of the query has its scoring inverted (think in terms of multiplication with -1)) or tell it to stop by doing

ok,thanks

halt: search engine development python

and it will stop. The syntax is just an example and obviously you need to set some kind of secret/password in order to instruct the crawler to change course or do something, but yeah that's essentially all there is to it.

One downside is that it's a bit slower at first and you might get a reply from a human being with useful results in the meantime but that's why it's a message board. You can discuss and collaborate on querying methods among other things and discuss the results with other people.

You would obviously be able to make posts hidden/private/password-protected.

Note that the speed does improve with each new query because they are compared against older queries and better scored matches are reused as starting points (this is already implemented) plus horizontal scaling could speed up crawling (not implemented yet) because the on-disk cache (cache is implemented) can be shared (sharing is not implemented).

My WWWPS crawler implementation is here https://github.com/lukal-x/wwwps I got the idea by experimenting in that repository.

I'll give it a messageboard interface ASAP and put it up on http://searchboard.ftp.sh (https://searchboard.ftp.sh).

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
resources		resources
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
start.research.sh		start.research.sh
stop.research.sh		stop.research.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WWWPS

Usage

Rationale

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WWWPS

Usage

Rationale

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages