.oO SearXNG Developer Documentation Oo.
Loading...
Searching...
No Matches
searx.engines.startpage Namespace Reference

Functions

 init (_)
 get_sc_code (searxng_locale, params)
 request (query, params)
tuple[str, datetime|None] _parse_published_date (str content)
 _get_web_result (result)
 _get_news_result (result)
dict[str, t.Any]|None _get_image_result (result)
 response (resp)
 fetch_traits (EngineTraits engine_traits)

Variables

dict about
str startpage_categ = 'web'
bool send_accept_language_header = True
list categories = ['general', 'web']
bool paging = True
int max_page = 18
bool time_range_support = True
bool safesearch = True
dict time_range_dict = {'day': 'd', 'week': 'w', 'month': 'm', 'year': 'y'}
dict safesearch_dict = {0: '0', 1: '1', 2: '1'}
str base_url = 'https://www.startpage.com'
str search_url = base_url + '/sp/search'
str search_form_xpath = '//form[@id="search"]'
int sc_code_cache_sec = 3600

Detailed Description

Startpage's language & region selectors are a mess ..

.. _startpage regions:

Startpage regions
=================

In the list of regions there are tags we need to map to common region tags::

  pt-BR_BR --> pt_BR
  zh-CN_CN --> zh_Hans_CN
  zh-TW_TW --> zh_Hant_TW
  zh-TW_HK --> zh_Hant_HK
  en-GB_GB --> en_GB

and there is at least one tag with a three letter language tag (ISO 639-2)::

  fil_PH --> fil_PH

The locale code ``no_NO`` from Startpage does not exists and is mapped to
``nb-NO``::

    babel.core.UnknownLocaleError: unknown locale 'no_NO'

For reference see languages-subtag at iana; ``no`` is the macrolanguage [1]_ and
W3C recommends subtag over macrolanguage [2]_.

.. [1] `iana: language-subtag-registry
   <https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry>`_ ::

      type: language
      Subtag: nb
      Description: Norwegian Bokmål
      Added: 2005-10-16
      Suppress-Script: Latn
      Macrolanguage: no

.. [2]
   Use macrolanguages with care.  Some language subtags have a Scope field set to
   macrolanguage, i.e. this primary language subtag encompasses a number of more
   specific primary language subtags in the registry.  ...  As we recommended for
   the collection subtags mentioned above, in most cases you should try to use
   the more specific subtags ... `W3: The primary language subtag
   <https://www.w3.org/International/questions/qa-choosing-language-tags#langsubtag>`_

.. _startpage languages:

Startpage languages
===================

:py:obj:`send_accept_language_header`:
  The displayed name in Startpage's settings page depend on the location of the
  IP when ``Accept-Language`` HTTP header is unset.  In :py:obj:`fetch_traits`
  we use::

    'Accept-Language': "en-US,en;q=0.5",
    ..

  to get uniform names independent from the IP).

.. _startpage categories:

Startpage categories
====================

Startpage's category (for Web-search, News, Videos, ..) is set by
:py:obj:`startpage_categ` in  settings.yml::

  - name: startpage
    engine: startpage
    startpage_categ: web
    ...

.. hint::

  Supported categories are ``web``, ``news`` and ``images``.

Function Documentation

◆ _get_image_result()

dict[str, t.Any] | None searx.engines.startpage._get_image_result ( result)
protected

Definition at line 373 of file startpage.py.

373def _get_image_result(result) -> dict[str, t.Any] | None:
374 url = result.get('altClickUrl')
375 if not url:
376 return None
377
378 thumbnailUrl = None
379 if result.get('thumbnailUrl'):
380 thumbnailUrl = base_url + result['thumbnailUrl']
381
382 resolution = None
383 if result.get('width') and result.get('height'):
384 resolution = f"{result['width']}x{result['height']}"
385
386 filesize = None
387 if result.get('filesize'):
388 size_str = ''.join(filter(str.isdigit, result['filesize']))
389 filesize = humanize_bytes(int(size_str))
390
391 return {
392 'template': 'images.html',
393 'url': url,
394 'title': html_to_text(result['title']),
395 'content': '',
396 'img_src': result.get('rawImageUrl'),
397 'thumbnail_src': thumbnailUrl,
398 'resolution': resolution,
399 'img_format': result.get('format'),
400 'filesize': filesize,
401 }
402
403

Referenced by response().

Here is the caller graph for this function:

◆ _get_news_result()

searx.engines.startpage._get_news_result ( result)
protected

Definition at line 351 of file startpage.py.

351def _get_news_result(result):
352
353 title = remove_pua_from_str(html_to_text(result['title']))
354 content = remove_pua_from_str(html_to_text(result.get('description')))
355
356 publishedDate = None
357 if result.get('date'):
358 publishedDate = datetime.fromtimestamp(result['date'] / 1000)
359
360 thumbnailUrl = None
361 if result.get('thumbnailUrl'):
362 thumbnailUrl = base_url + result['thumbnailUrl']
363
364 return {
365 'url': result['clickUrl'],
366 'title': title,
367 'content': content,
368 'publishedDate': publishedDate,
369 'thumbnail': thumbnailUrl,
370 }
371
372

Referenced by response().

Here is the caller graph for this function:

◆ _get_web_result()

searx.engines.startpage._get_web_result ( result)
protected

Definition at line 339 of file startpage.py.

339def _get_web_result(result):
340 content = html_to_text(result.get('description'))
341 content, publishedDate = _parse_published_date(content)
342
343 return {
344 'url': result['clickUrl'],
345 'title': html_to_text(result['title']),
346 'content': content,
347 'publishedDate': publishedDate,
348 }
349
350

References _parse_published_date().

Referenced by response().

Here is the call graph for this function:
Here is the caller graph for this function:

◆ _parse_published_date()

tuple[str, datetime | None] searx.engines.startpage._parse_published_date ( str content)
protected

Definition at line 310 of file startpage.py.

310def _parse_published_date(content: str) -> tuple[str, datetime | None]:
311 published_date = None
312
313 # check if search result starts with something like: "2 Sep 2014 ... "
314 if re.match(r"^([1-9]|[1-2][0-9]|3[0-1]) [A-Z][a-z]{2} [0-9]{4} \.\.\. ", content):
315 date_pos = content.find('...') + 4
316 date_string = content[0 : date_pos - 5]
317 # fix content string
318 content = content[date_pos:]
319
320 try:
321 published_date = dateutil.parser.parse(date_string, dayfirst=True)
322 except ValueError:
323 pass
324
325 # check if search result starts with something like: "5 days ago ... "
326 elif re.match(r"^[0-9]+ days? ago \.\.\. ", content):
327 date_pos = content.find('...') + 4
328 date_string = content[0 : date_pos - 5]
329
330 # calculate datetime
331 published_date = datetime.now() - timedelta(days=int(re.match(r'\d+', date_string).group())) # type: ignore
332
333 # fix content string
334 content = content[date_pos:]
335
336 return content, published_date
337
338

Referenced by _get_web_result().

Here is the caller graph for this function:

◆ fetch_traits()

searx.engines.startpage.fetch_traits ( EngineTraits engine_traits)
Fetch :ref:`languages <startpage languages>` and :ref:`regions <startpage
regions>` from Startpage.

Definition at line 425 of file startpage.py.

425def fetch_traits(engine_traits: EngineTraits):
426 """Fetch :ref:`languages <startpage languages>` and :ref:`regions <startpage
427 regions>` from Startpage."""
428 # pylint: disable=too-many-branches
429
430 headers = {
431 'User-Agent': gen_useragent(),
432 'Accept-Language': "en-US,en;q=0.5", # bing needs to set the English language
433 }
434 resp = get('https://www.startpage.com/do/settings', headers=headers)
435
436 if not resp.ok: # type: ignore
437 print("ERROR: response from Startpage is not OK.")
438
439 dom = lxml.html.fromstring(resp.text) # type: ignore
440
441 # regions
442
443 sp_region_names = []
444 for option in dom.xpath('//form[@name="settings"]//select[@name="search_results_region"]/option'):
445 sp_region_names.append(option.get('value'))
446
447 for eng_tag in sp_region_names:
448 if eng_tag == 'all':
449 continue
450 babel_region_tag = {'no_NO': 'nb_NO'}.get(eng_tag, eng_tag) # norway
451
452 if '-' in babel_region_tag:
453 l, r = babel_region_tag.split('-')
454 r = r.split('_')[-1]
455 sxng_tag = region_tag(babel.Locale.parse(l + '_' + r, sep='_'))
456
457 else:
458 try:
459 sxng_tag = region_tag(babel.Locale.parse(babel_region_tag, sep='_'))
460
461 except babel.UnknownLocaleError:
462 print("ERROR: can't determine babel locale of startpage's locale %s" % eng_tag)
463 continue
464
465 conflict = engine_traits.regions.get(sxng_tag)
466 if conflict:
467 if conflict != eng_tag:
468 print("CONFLICT: babel %s --> %s, %s" % (sxng_tag, conflict, eng_tag))
469 continue
470 engine_traits.regions[sxng_tag] = eng_tag
471
472 # languages
473
474 catalog_engine2code = {name.lower(): lang_code for lang_code, name in babel.Locale('en').languages.items()}
475
476 # get the native name of every language known by babel
477
478 for lang_code in filter(lambda lang_code: lang_code.find('_') == -1, babel.localedata.locale_identifiers()):
479 native_name = babel.Locale(lang_code).get_language_name()
480 if not native_name:
481 print(f"ERROR: language name of startpage's language {lang_code} is unknown by babel")
482 continue
483 native_name = native_name.lower()
484 # add native name exactly as it is
485 catalog_engine2code[native_name] = lang_code
486
487 # add "normalized" language name (i.e. français becomes francais and español becomes espanol)
488 unaccented_name = ''.join(filter(lambda c: not combining(c), normalize('NFKD', native_name)))
489 if len(unaccented_name) == len(unaccented_name.encode()):
490 # add only if result is ascii (otherwise "normalization" didn't work)
491 catalog_engine2code[unaccented_name] = lang_code
492
493 # values that can't be determined by babel's languages names
494
495 catalog_engine2code.update(
496 {
497 # traditional chinese used in ..
498 'fantizhengwen': 'zh_Hant',
499 # Korean alphabet
500 'hangul': 'ko',
501 # Malayalam is one of 22 scheduled languages of India.
502 'malayam': 'ml',
503 'norsk': 'nb',
504 'sinhalese': 'si',
505 }
506 )
507
508 skip_eng_tags = {
509 'english_uk', # SearXNG lang 'en' already maps to 'english'
510 }
511
512 for option in dom.xpath('//form[@name="settings"]//select[@name="language"]/option'):
513
514 eng_tag = option.get('value')
515 if eng_tag in skip_eng_tags:
516 continue
517 name = extract_text(option).lower() # type: ignore
518
519 sxng_tag = catalog_engine2code.get(eng_tag)
520 if sxng_tag is None:
521 sxng_tag = catalog_engine2code[name]
522
523 conflict = engine_traits.languages.get(sxng_tag)
524 if conflict:
525 if conflict != eng_tag:
526 print("CONFLICT: babel %s --> %s, %s" % (sxng_tag, conflict, eng_tag))
527 continue
528 engine_traits.languages[sxng_tag] = eng_tag

◆ get_sc_code()

searx.engines.startpage.get_sc_code ( searxng_locale,
params )
Get an actual ``sc`` argument from Startpage's search form (HTML page).

Startpage puts a ``sc`` argument on every HTML :py:obj:`search form
<search_form_xpath>`.  Without this argument Startpage considers the request
is from a bot.  We do not know what is encoded in the value of the ``sc``
argument, but it seems to be a kind of a *timestamp*.

Startpage's search form generates a new sc-code on each request.  This
function scrapes a new sc-code from Startpage's home page every
:py:obj:`sc_code_cache_sec` seconds.

Definition at line 173 of file startpage.py.

173def get_sc_code(searxng_locale, params):
174 """Get an actual ``sc`` argument from Startpage's search form (HTML page).
175
176 Startpage puts a ``sc`` argument on every HTML :py:obj:`search form
177 <search_form_xpath>`. Without this argument Startpage considers the request
178 is from a bot. We do not know what is encoded in the value of the ``sc``
179 argument, but it seems to be a kind of a *timestamp*.
180
181 Startpage's search form generates a new sc-code on each request. This
182 function scrapes a new sc-code from Startpage's home page every
183 :py:obj:`sc_code_cache_sec` seconds."""
184
185 sc_code = CACHE.get("SC_CODE")
186
187 if sc_code:
188 logger.debug("get_sc_code: using cached value: %s", sc_code)
189 return sc_code
190
191 headers = {**params['headers']}
192
193 # add Accept-Language header
194 if searxng_locale == 'all':
195 searxng_locale = 'en-US'
196 locale = babel.Locale.parse(searxng_locale, sep='-')
197
198 if send_accept_language_header:
199 ac_lang = locale.language
200 if locale.territory:
201 ac_lang = "%s-%s,%s;q=0.9,*;q=0.5" % (
202 locale.language,
203 locale.territory,
204 locale.language,
205 )
206 headers['Accept-Language'] = ac_lang
207
208 get_sc_url = base_url + '/'
209 logger.debug("get_sc_code: querying new sc timestamp @ %s", get_sc_url)
210 logger.debug("get_sc_code: request headers: %s", headers)
211 resp = get(get_sc_url, headers=headers)
212
213 # ?? x = network.get('https://www.startpage.com/sp/cdn/images/filter-chevron.svg', headers=headers)
214 # ?? https://www.startpage.com/sp/cdn/images/filter-chevron.svg
215 # ?? ping-back URL: https://www.startpage.com/sp/pb?sc=TLsB0oITjZ8F21
216
217 if str(resp.url).startswith('https://www.startpage.com/sp/captcha'): # type: ignore
218 raise SearxEngineCaptchaException(
219 message="get_sc_code: got redirected to https://www.startpage.com/sp/captcha",
220 )
221
222 dom = lxml.html.fromstring(resp.text) # type: ignore
223
224 try:
225 sc_code = eval_xpath(dom, search_form_xpath + '//input[@name="sc"]/@value')[0]
226 except IndexError as exc:
227 logger.debug("suspend startpage API --> https://github.com/searxng/searxng/pull/695")
228 raise SearxEngineCaptchaException(
229 message="get_sc_code: [PR-695] querying new sc timestamp failed! (%s)" % resp.url, # type: ignore
230 ) from exc
231
232 sc_code = str(sc_code)
233 logger.debug("get_sc_code: new value is: %s", sc_code)
234 CACHE.set(key="SC_CODE", value=sc_code, expire=sc_code_cache_sec)
235 return sc_code
236
237

Referenced by request().

Here is the caller graph for this function:

◆ init()

searx.engines.startpage.init ( _)

Definition at line 161 of file startpage.py.

161def init(_):
162 global CACHE # pylint: disable=global-statement
163
164 # hint: all three startpage engines (WEB, Images & News) can/should use the
165 # same sc_code ..
166 CACHE = EngineCache("startpage") # type:ignore
167
168

◆ request()

searx.engines.startpage.request ( query,
params )
Assemble a Startpage request.

To avoid CAPTCHAs we need to send a well formed HTTP POST request with a
cookie. We need to form a request that is identical to the request built by
Startpage's search form:

- in the cookie the **region** is selected
- in the HTTP POST data the **language** is selected

Additionally the arguments form Startpage's search form needs to be set in
HTML POST data / compare ``<input>`` elements: :py:obj:`search_form_xpath`.

Definition at line 238 of file startpage.py.

238def request(query, params):
239 """Assemble a Startpage request.
240
241 To avoid CAPTCHAs we need to send a well formed HTTP POST request with a
242 cookie. We need to form a request that is identical to the request built by
243 Startpage's search form:
244
245 - in the cookie the **region** is selected
246 - in the HTTP POST data the **language** is selected
247
248 Additionally the arguments form Startpage's search form needs to be set in
249 HTML POST data / compare ``<input>`` elements: :py:obj:`search_form_xpath`.
250 """
251 engine_region = traits.get_region(params['searxng_locale'], 'en-US')
252 engine_language = traits.get_language(params['searxng_locale'], 'en')
253
254 params['headers']['Origin'] = base_url
255 params['headers']['Referer'] = base_url + '/'
256
257 # Build form data
258 args = {
259 'query': query,
260 'cat': startpage_categ,
261 't': 'device',
262 'sc': get_sc_code(params['searxng_locale'], params), # hint: this func needs HTTP headers
263 'with_date': time_range_dict.get(params['time_range'], ''),
264 'abp': '1',
265 'abd': '1',
266 'abe': '1',
267 }
268
269 if engine_language:
270 args['language'] = engine_language
271 args['lui'] = engine_language
272
273 if params['pageno'] > 1:
274 args['page'] = params['pageno']
275 args['segment'] = 'startpage.udog'
276
277 # Build cookie
278 lang_homepage = 'en'
279 cookie = OrderedDict()
280 cookie['date_time'] = 'world'
281 cookie['disable_family_filter'] = safesearch_dict[params['safesearch']]
282 cookie['disable_open_in_new_window'] = '0'
283 cookie['enable_post_method'] = '1' # hint: POST
284 cookie['enable_proxy_safety_suggest'] = '1'
285 cookie['enable_stay_control'] = '1'
286 cookie['instant_answers'] = '1'
287 cookie['lang_homepage'] = 's/device/%s/' % lang_homepage
288 cookie['num_of_results'] = '10'
289 cookie['suggestions'] = '1'
290 cookie['wt_unit'] = 'celsius'
291
292 if engine_language:
293 cookie['language'] = engine_language
294 cookie['language_ui'] = engine_language
295
296 if engine_region:
297 cookie['search_results_region'] = engine_region
298
299 params['cookies']['preferences'] = 'N1N'.join(["%sEEE%s" % x for x in cookie.items()])
300 logger.debug('cookie preferences: %s', params['cookies']['preferences'])
301
302 logger.debug("data: %s", args)
303 params['data'] = args
304 params['method'] = 'POST'
305 params['url'] = search_url
306
307 return params
308
309

References get_sc_code().

Here is the call graph for this function:

◆ response()

searx.engines.startpage.response ( resp)

Definition at line 404 of file startpage.py.

404def response(resp):
405 categ = startpage_categ.capitalize()
406 results_raw = '{' + extr(resp.text, f"React.createElement(UIStartpage.AppSerp{categ}, {{", '}})') + '}}'
407 results_json = loads(results_raw)
408 results_obj = results_json.get('render', {}).get('presenter', {}).get('regions', {})
409
410 results = []
411 for results_categ in results_obj.get('mainline', []):
412 for item in results_categ.get('results', []):
413 if results_categ['display_type'] == 'web-google':
414 results.append(_get_web_result(item))
415 elif results_categ['display_type'] == 'news-bing':
416 results.append(_get_news_result(item))
417 elif 'images' in results_categ['display_type']:
418 item = _get_image_result(item)
419 if item:
420 results.append(item)
421
422 return results
423
424

References _get_image_result(), _get_news_result(), and _get_web_result().

Here is the call graph for this function:

Variable Documentation

◆ about

dict searx.engines.startpage.about
Initial value:
1= {
2 "website": 'https://startpage.com',
3 "wikidata_id": 'Q2333295',
4 "official_api_documentation": None,
5 "use_official_api": False,
6 "require_api_key": False,
7 "results": 'HTML',
8}

Definition at line 102 of file startpage.py.

◆ base_url

str searx.engines.startpage.base_url = 'https://www.startpage.com'

Definition at line 134 of file startpage.py.

◆ categories

list searx.engines.startpage.categories = ['general', 'web']

Definition at line 122 of file startpage.py.

◆ max_page

int searx.engines.startpage.max_page = 18

Definition at line 124 of file startpage.py.

◆ paging

bool searx.engines.startpage.paging = True

Definition at line 123 of file startpage.py.

◆ safesearch

bool searx.engines.startpage.safesearch = True

Definition at line 128 of file startpage.py.

◆ safesearch_dict

dict searx.engines.startpage.safesearch_dict = {0: '0', 1: '1', 2: '1'}

Definition at line 131 of file startpage.py.

◆ sc_code_cache_sec

int searx.engines.startpage.sc_code_cache_sec = 3600

Definition at line 169 of file startpage.py.

◆ search_form_xpath

str searx.engines.startpage.search_form_xpath = '//form[@id="search"]'

Definition at line 140 of file startpage.py.

◆ search_url

str searx.engines.startpage.search_url = base_url + '/sp/search'

Definition at line 135 of file startpage.py.

◆ send_accept_language_header

bool searx.engines.startpage.send_accept_language_header = True

Definition at line 115 of file startpage.py.

◆ startpage_categ

str searx.engines.startpage.startpage_categ = 'web'

Definition at line 111 of file startpage.py.

◆ time_range_dict

dict searx.engines.startpage.time_range_dict = {'day': 'd', 'week': 'w', 'month': 'm', 'year': 'y'}

Definition at line 130 of file startpage.py.

◆ time_range_support

bool searx.engines.startpage.time_range_support = True

Definition at line 127 of file startpage.py.