[Samba] Spotlight indexing with fscrawler for multiple shares

Kees van Vloten keesvanvloten at gmail.com
Fri Aug 11 17:35:53 UTC 2023


On 10-08-2023 15:38, Matthias Kühne | Ellerhold Aktiengesellschaft via 
samba wrote:
> Hey Kees,
>
> fs2es-indexer is designed to be a lightweight alternative to FSCrawler.
> So no ... it doesnt do any content indexing or saves much of the metadata.
>
> As far as I understand it the OCR and other stuff makes FScrawler that
> big. And we dont need any of that - we just want to search for file names.
>
> BUT Im open for merge requests ;-)
>
> I currently getting away with a lot less complexity because I dont need
> to watch for changes in files. Because thats not something I'm indexing.
> If I'd be adding more metadata (even only size!) I have to verify that
> it stays correct and start to listen to "file X has changed" events
> somehow...
>
> fanotify seems like a sweet framework for that, but sadly ZFS is
> incompatible with it...

I have a closer look at how FScrawler handles this and that turns out to 
be rather simple.

It stores a timestamp of the last run in _status.json and in the next 
run it looks at files modified after that timestamp only.

If you want it to reindex all file you can simple remove the 
_status.json file and wait for the next run. Nothing high-tech or 
complex here :-)

> Samba does not let me get this data efficiently either, so Im forced to
> regular scans of the whole fs.... which might take a while depending on
> the amount of files.
>
> Adding support for opensearch though shouldnt be that hard, right? I've
> already got a version switch for ES v7 and v8, adding OS to it should be
> easy enough!
>
> Have a nice day,
> Matthias.
>
> Am 10.08.23 um 15:01 schrieb Kees van Vloten via samba:
>> Hi Matthias,
>>
>> Op 10-08-2023 om 14:46 schreef Matthias Kühne | Ellerhold
>> Aktiengesellschaft via samba:
>>> Hey Kees,
>>>
>>> disclaimer: shameless self-plug!!
>>>
>>> If you dont need content indexing you can use my indexer:
>>> https://github.com/Ellerhold/fs2es-indexer
>> I have looked at it because of troubles with FScrawler and I love your
>> solution because it does not need heavy weight java.
>>
>> But there is one thing FScrawler is good at: it indexes all kinds of
>> metadata of files (like exif data in photos etc), it can even do OCR.
>> This is what the fs2es-indexer does not seem to do (to my understanding).
>>
>> That is the reason why I am stuck with FScrawler for now.
>>
>> Hopefully I am wrong and you are going to tell me that fs2es-indexer
>> has all the functionality of FScrawler but not the issues :-)
>>
>> The other thing is that I am pushing data to Opensearch which requires
>> me to patch and  compile FScrawler, another complexity I don't like
>> very much.
>>
>> - Kees
>>
>>> Ive created it because I couldnt get FScrawler to work correctly.
>>>
>>> You can add as many directories as you like in the config, it'll crawl
>>> it through one daemon service.
>>>
>>> I'm planning on adding smb.conf parsing, so you dont even have to add
>>> these directories into the yaml file and just use samba as you would.
>>>
>>> Let me know if you need some help setting it up or otherwise.
>>>
>>> Have a nice day,
>>>
>>> Matthias.
>>>
>>> Am 04.08.23 um 19:56 schrieb Kees van Vloten via samba:
>>>> Hi Team,
>>>>
>>>>
>>>> Did anybody solve the issue of FScrawler crawling over multiple
>>>> shares, preferably from a single job or from a single service?
>>>>
>>>> Setting up a service for FScrawler per share does not scale very
>>>> nice...
>>>>
>>>>
>>>> - Kees.
>>>>
>>>>



More information about the samba mailing list