The lists below contain stop words that we use to clean up text files on webpageanalyse.com. The lists are not comprehensive and can be freely adopted.
Our system treats all words that have no inference on the meaning or content of a text/document as stop words. We keep separate stop word lists in a total of 9 languages that are used by websites for recognising text contents.
Before our stop word lists can be applied, the language of a text must be clearly ascertained. We identify the language with a language recognition tool developed by us, which is also available as an API. Once we have clearly ascertained the language of a text, we filter out that language's stop words from the text. We then carry out further analyses for our portals similarsitecheck.com and popuri.us, based on the normalised text.
Below we have listed the word groups that we treat as stop words:
- pronouns (e.g. 'it', 'we', 'her', 'she')
- definite and indefinite articles ('the')('a', 'an')
- conjunctions (e.g. 'and', 'or', 'yet')
- prepositions (e.g. 'on', 'in', 'by')
- negations (e.g. 'non', 'not')
- stop letters (e.g. 'a', 'I')
- selected adjectives and adverbs
Symbols such as full stops (.), commas (,) or semi-colons (;) are also used frequently as stop words. In general we don't deal with these so we don't maintain our own list with symbols. This is why there are words with removed symbols in our stop word lists, particularly in foreign languages.
List of stop words by language: