So I’ve been following the Ladybird browser project and I love the open-source approach they’re taking. Got me thinking - why not a search engine?
I know Google’s got a stranglehold on the market, but I’m curious - would you use an open-source search engine if it prioritized privacy, transparency, community involvement, and user control? What features would you want to see?
I like some of the features that Kagi are implementing but they’re not open source.
The search engine is not the problem - the data (indexing) and the computre resources needed for constant crawling, indexing, saving data, running the search transactions etc are the issue.
You need big pockets for them. As long as Google/Bing etc remain free to use why would you use all that money to create your own index?
Unfortunately open source would mean people would know exactly how to take advantage of it. The results would be filled with spam.
The only work around would be intense curation of the results, but then spammers would become the curators.
This is what happened to the DMOZ central directory that Google used in the first few years. (Among other problems)
Plus, like previously mentioned who would cover the cost of the infrastructures?
Searx is a pseudo solution because it preserves privacy but it relies on other search engines.
Edit: maybe niche search engines by topics could be a way. We’d still rely on curation. But it would be more manageable for a few individuals and we could compare them by the quality of their results.
You need to index it first with bots and crawlers, and provide that index to the actual search engine. All this requires computing power of the type you need cooled factory halls for.
Afaik there’s only two companies who do this nowadays: Bing and Google. Maybe Yandex, too. Correct me if I’m wrong.
Everything else are various frontends. On that front many privacy friendly and/or FOSS solutions exist already.
There is Marginalia, the coolest search engine right now. It focuses on non-commercial, small, independent sites. It’s not a google or bing replacement, but it brings joy and serendipity to my searches. That rush of the Internet of yore.
Many people have said that the issue is running it, but projects like Torrent or SheepIr! Render farm give me the idea to let people host worker nodes themselves.
I think that a distributed index would be a good project and people would be interested.
I just had a simple but fun idea: A personal search engine that has a database off all pages I’ve ever visited. It could scan for updates on those pages and stay current. Now, getting new pages into that requires using a different search, but once you have such a db, it could autosuggest that you let it widen certain pages to include the entire site. Text-only, of course, probably compressed.
I was thinking about Grover’s algorithm awhile back for unstructured search that finds with high probability. Wow, my neurotransmitters likey-likey very much.
I’ve tried to use DuckDuckGo for a while, it’s usable but it seems worse than Google. It has mostly less relevant results because they index less pages, it’s clearly visible when searching images and articles. Google also has some additional features like direct answers on some queries (i used it to find info about square and population of cities and countries)
I would guess that a solution would be a browser plug-in so that sites actually used and visited get added. The problem though is you then have to find a way of adding new sites and you have to think about privacy. Paywalls would not be a problem for the index but copyright might be and privacy would be.
Storing and searching the data could be collaborative with a p2p system like bitcoin. But most browsers have a cache so a lot of data is already stored across billions of devices. You would have to figure out how to pass out the queries and then filter, collate and sort the answers. Nobody wants a million answers so a query would travel more widely if there were no answers and would not need to travel far for a popular search.
There’s a special algorithm for sending the search out to many devices and returning it efficiently. But a mistake in that could crash the internet for a search with no answer asking everyone so sensible limits are necessary. If a thousand people don’t have it you are asking for something very obscure.
A good search needs access to books, journals, news, and things to buy. There are so many obstacles to doing it well but goggle for all its faults gives an answer most of the time….unlike Amazon…but as google becomes more like Amazon an alternative becomes necessary.
Ai is new so it can work better than google and bing but give it time and it will be spammed like everything else. It could be argued that it is already spammed but we don’t notice yet because it is a different spam for now. They must be filtering it out somehow because we don’t have things like protons parts of atoms and are currently free and you need to pay us some money if you want to keep it that way.
You need a server for that. It’s not something that someone can download and install in their PC. So the question has to do more about the server’s management and policies and less about if if the software itself is open source or not.
Too expensive. Software you can write and distribute with no overhead. Search engine needs massive infrastructure to crawl, index, process requests, etc