.oO SearXNG Developer Documentation Oo.
Loading...
Searching...
No Matches
searx.engines.startpage Namespace Reference

Functions

 get_sc_code (searxng_locale, params)
 
 request (query, params)
 
tuple[str, datetime|None] _parse_published_date (str content)
 
 _get_web_result (result)
 
 _get_news_result (result)
 
dict[str, Any]|None _get_image_result (result)
 
 response (resp)
 
 fetch_traits (EngineTraits engine_traits)
 

Variables

logging logger .Logger
 
dict about
 
str startpage_categ = 'web'
 
bool send_accept_language_header = True
 
list categories = ['general', 'web']
 
bool paging = True
 
int max_page = 18
 
bool time_range_support = True
 
bool safesearch = True
 
dict time_range_dict = {'day': 'd', 'week': 'w', 'month': 'm', 'year': 'y'}
 
dict safesearch_dict = {0: '0', 1: '1', 2: '1'}
 
str base_url = 'https://www.startpage.com'
 
str search_url = base_url + '/sp/search'
 
str search_form_xpath = '//form[@id="search"]'
 
int sc_code_ts = 0
 
str sc_code = ''
 
int sc_code_cache_sec = 30
 

Detailed Description

Startpage's language & region selectors are a mess ..

.. _startpage regions:

Startpage regions
=================

In the list of regions there are tags we need to map to common region tags::

  pt-BR_BR --> pt_BR
  zh-CN_CN --> zh_Hans_CN
  zh-TW_TW --> zh_Hant_TW
  zh-TW_HK --> zh_Hant_HK
  en-GB_GB --> en_GB

and there is at least one tag with a three letter language tag (ISO 639-2)::

  fil_PH --> fil_PH

The locale code ``no_NO`` from Startpage does not exists and is mapped to
``nb-NO``::

    babel.core.UnknownLocaleError: unknown locale 'no_NO'

For reference see languages-subtag at iana; ``no`` is the macrolanguage [1]_ and
W3C recommends subtag over macrolanguage [2]_.

.. [1] `iana: language-subtag-registry
   <https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry>`_ ::

      type: language
      Subtag: nb
      Description: Norwegian Bokmål
      Added: 2005-10-16
      Suppress-Script: Latn
      Macrolanguage: no

.. [2]
   Use macrolanguages with care.  Some language subtags have a Scope field set to
   macrolanguage, i.e. this primary language subtag encompasses a number of more
   specific primary language subtags in the registry.  ...  As we recommended for
   the collection subtags mentioned above, in most cases you should try to use
   the more specific subtags ... `W3: The primary language subtag
   <https://www.w3.org/International/questions/qa-choosing-language-tags#langsubtag>`_

.. _startpage languages:

Startpage languages
===================

:py:obj:`send_accept_language_header`:
  The displayed name in Startpage's settings page depend on the location of the
  IP when ``Accept-Language`` HTTP header is unset.  In :py:obj:`fetch_traits`
  we use::

    'Accept-Language': "en-US,en;q=0.5",
    ..

  to get uniform names independent from the IP).

.. _startpage categories:

Startpage categories
====================

Startpage's category (for Web-search, News, Videos, ..) is set by
:py:obj:`startpage_categ` in  settings.yml::

  - name: startpage
    engine: startpage
    startpage_categ: web
    ...

.. hint::

  Supported categories are ``web``, ``news`` and ``images``.

Function Documentation

◆ _get_image_result()

dict[str, Any] | None searx.engines.startpage._get_image_result ( result)
protected

Definition at line 375 of file startpage.py.

375def _get_image_result(result) -> dict[str, Any] | None:
376 url = result.get('altClickUrl')
377 if not url:
378 return None
379
380 thumbnailUrl = None
381 if result.get('thumbnailUrl'):
382 thumbnailUrl = base_url + result['thumbnailUrl']
383
384 resolution = None
385 if result.get('width') and result.get('height'):
386 resolution = f"{result['width']}x{result['height']}"
387
388 filesize = None
389 if result.get('filesize'):
390 size_str = ''.join(filter(str.isdigit, result['filesize']))
391 filesize = humanize_bytes(int(size_str))
392
393 return {
394 'template': 'images.html',
395 'url': url,
396 'title': html_to_text(result['title']),
397 'content': '',
398 'img_src': result.get('rawImageUrl'),
399 'thumbnail_src': thumbnailUrl,
400 'resolution': resolution,
401 'img_format': result.get('format'),
402 'filesize': filesize,
403 }
404
405

Referenced by response().

+ Here is the caller graph for this function:

◆ _get_news_result()

searx.engines.startpage._get_news_result ( result)
protected

Definition at line 353 of file startpage.py.

353def _get_news_result(result):
354
355 title = remove_pua_from_str(html_to_text(result['title']))
356 content = remove_pua_from_str(html_to_text(result.get('description')))
357
358 publishedDate = None
359 if result.get('date'):
360 publishedDate = datetime.fromtimestamp(result['date'] / 1000)
361
362 thumbnailUrl = None
363 if result.get('thumbnailUrl'):
364 thumbnailUrl = base_url + result['thumbnailUrl']
365
366 return {
367 'url': result['clickUrl'],
368 'title': title,
369 'content': content,
370 'publishedDate': publishedDate,
371 'thumbnail': thumbnailUrl,
372 }
373
374

Referenced by response().

+ Here is the caller graph for this function:

◆ _get_web_result()

searx.engines.startpage._get_web_result ( result)
protected

Definition at line 341 of file startpage.py.

341def _get_web_result(result):
342 content = html_to_text(result.get('description'))
343 content, publishedDate = _parse_published_date(content)
344
345 return {
346 'url': result['clickUrl'],
347 'title': html_to_text(result['title']),
348 'content': content,
349 'publishedDate': publishedDate,
350 }
351
352

References _parse_published_date().

Referenced by response().

+ Here is the call graph for this function:
+ Here is the caller graph for this function:

◆ _parse_published_date()

tuple[str, datetime | None] searx.engines.startpage._parse_published_date ( str content)
protected

Definition at line 312 of file startpage.py.

312def _parse_published_date(content: str) -> tuple[str, datetime | None]:
313 published_date = None
314
315 # check if search result starts with something like: "2 Sep 2014 ... "
316 if re.match(r"^([1-9]|[1-2][0-9]|3[0-1]) [A-Z][a-z]{2} [0-9]{4} \.\.\. ", content):
317 date_pos = content.find('...') + 4
318 date_string = content[0 : date_pos - 5]
319 # fix content string
320 content = content[date_pos:]
321
322 try:
323 published_date = dateutil.parser.parse(date_string, dayfirst=True)
324 except ValueError:
325 pass
326
327 # check if search result starts with something like: "5 days ago ... "
328 elif re.match(r"^[0-9]+ days? ago \.\.\. ", content):
329 date_pos = content.find('...') + 4
330 date_string = content[0 : date_pos - 5]
331
332 # calculate datetime
333 published_date = datetime.now() - timedelta(days=int(re.match(r'\d+', date_string).group())) # type: ignore
334
335 # fix content string
336 content = content[date_pos:]
337
338 return content, published_date
339
340

Referenced by _get_web_result().

+ Here is the caller graph for this function:

◆ fetch_traits()

searx.engines.startpage.fetch_traits ( EngineTraits engine_traits)
Fetch :ref:`languages <startpage languages>` and :ref:`regions <startpage
regions>` from Startpage.

Definition at line 427 of file startpage.py.

427def fetch_traits(engine_traits: EngineTraits):
428 """Fetch :ref:`languages <startpage languages>` and :ref:`regions <startpage
429 regions>` from Startpage."""
430 # pylint: disable=too-many-branches
431
432 headers = {
433 'User-Agent': gen_useragent(),
434 'Accept-Language': "en-US,en;q=0.5", # bing needs to set the English language
435 }
436 resp = get('https://www.startpage.com/do/settings', headers=headers)
437
438 if not resp.ok: # type: ignore
439 print("ERROR: response from Startpage is not OK.")
440
441 dom = lxml.html.fromstring(resp.text) # type: ignore
442
443 # regions
444
445 sp_region_names = []
446 for option in dom.xpath('//form[@name="settings"]//select[@name="search_results_region"]/option'):
447 sp_region_names.append(option.get('value'))
448
449 for eng_tag in sp_region_names:
450 if eng_tag == 'all':
451 continue
452 babel_region_tag = {'no_NO': 'nb_NO'}.get(eng_tag, eng_tag) # norway
453
454 if '-' in babel_region_tag:
455 l, r = babel_region_tag.split('-')
456 r = r.split('_')[-1]
457 sxng_tag = region_tag(babel.Locale.parse(l + '_' + r, sep='_'))
458
459 else:
460 try:
461 sxng_tag = region_tag(babel.Locale.parse(babel_region_tag, sep='_'))
462
463 except babel.UnknownLocaleError:
464 print("ERROR: can't determine babel locale of startpage's locale %s" % eng_tag)
465 continue
466
467 conflict = engine_traits.regions.get(sxng_tag)
468 if conflict:
469 if conflict != eng_tag:
470 print("CONFLICT: babel %s --> %s, %s" % (sxng_tag, conflict, eng_tag))
471 continue
472 engine_traits.regions[sxng_tag] = eng_tag
473
474 # languages
475
476 catalog_engine2code = {name.lower(): lang_code for lang_code, name in babel.Locale('en').languages.items()}
477
478 # get the native name of every language known by babel
479
480 for lang_code in filter(lambda lang_code: lang_code.find('_') == -1, babel.localedata.locale_identifiers()):
481 native_name = babel.Locale(lang_code).get_language_name()
482 if not native_name:
483 print(f"ERROR: language name of startpage's language {lang_code} is unknown by babel")
484 continue
485 native_name = native_name.lower()
486 # add native name exactly as it is
487 catalog_engine2code[native_name] = lang_code
488
489 # add "normalized" language name (i.e. français becomes francais and español becomes espanol)
490 unaccented_name = ''.join(filter(lambda c: not combining(c), normalize('NFKD', native_name)))
491 if len(unaccented_name) == len(unaccented_name.encode()):
492 # add only if result is ascii (otherwise "normalization" didn't work)
493 catalog_engine2code[unaccented_name] = lang_code
494
495 # values that can't be determined by babel's languages names
496
497 catalog_engine2code.update(
498 {
499 # traditional chinese used in ..
500 'fantizhengwen': 'zh_Hant',
501 # Korean alphabet
502 'hangul': 'ko',
503 # Malayalam is one of 22 scheduled languages of India.
504 'malayam': 'ml',
505 'norsk': 'nb',
506 'sinhalese': 'si',
507 }
508 )
509
510 skip_eng_tags = {
511 'english_uk', # SearXNG lang 'en' already maps to 'english'
512 }
513
514 for option in dom.xpath('//form[@name="settings"]//select[@name="language"]/option'):
515
516 eng_tag = option.get('value')
517 if eng_tag in skip_eng_tags:
518 continue
519 name = extract_text(option).lower() # type: ignore
520
521 sxng_tag = catalog_engine2code.get(eng_tag)
522 if sxng_tag is None:
523 sxng_tag = catalog_engine2code[name]
524
525 conflict = engine_traits.languages.get(sxng_tag)
526 if conflict:
527 if conflict != eng_tag:
528 print("CONFLICT: babel %s --> %s, %s" % (sxng_tag, conflict, eng_tag))
529 continue
530 engine_traits.languages[sxng_tag] = eng_tag

◆ get_sc_code()

searx.engines.startpage.get_sc_code ( searxng_locale,
params )
Get an actual ``sc`` argument from Startpage's search form (HTML page).

Startpage puts a ``sc`` argument on every HTML :py:obj:`search form
<search_form_xpath>`.  Without this argument Startpage considers the request
is from a bot.  We do not know what is encoded in the value of the ``sc``
argument, but it seems to be a kind of a *time-stamp*.

Startpage's search form generates a new sc-code on each request.  This
function scrap a new sc-code from Startpage's home page every
:py:obj:`sc_code_cache_sec` seconds.

Definition at line 169 of file startpage.py.

169def get_sc_code(searxng_locale, params):
170 """Get an actual ``sc`` argument from Startpage's search form (HTML page).
171
172 Startpage puts a ``sc`` argument on every HTML :py:obj:`search form
173 <search_form_xpath>`. Without this argument Startpage considers the request
174 is from a bot. We do not know what is encoded in the value of the ``sc``
175 argument, but it seems to be a kind of a *time-stamp*.
176
177 Startpage's search form generates a new sc-code on each request. This
178 function scrap a new sc-code from Startpage's home page every
179 :py:obj:`sc_code_cache_sec` seconds.
180
181 """
182
183 global sc_code_ts, sc_code # pylint: disable=global-statement
184
185 if sc_code and (time() < (sc_code_ts + sc_code_cache_sec)):
186 logger.debug("get_sc_code: reuse '%s'", sc_code)
187 return sc_code
188
189 headers = {**params['headers']}
190 headers['Origin'] = base_url
191 headers['Referer'] = base_url + '/'
192 # headers['Connection'] = 'keep-alive'
193 # headers['Accept-Encoding'] = 'gzip, deflate, br'
194 # headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8'
195 # headers['User-Agent'] = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:105.0) Gecko/20100101 Firefox/105.0'
196
197 # add Accept-Language header
198 if searxng_locale == 'all':
199 searxng_locale = 'en-US'
200 locale = babel.Locale.parse(searxng_locale, sep='-')
201
202 if send_accept_language_header:
203 ac_lang = locale.language
204 if locale.territory:
205 ac_lang = "%s-%s,%s;q=0.9,*;q=0.5" % (
206 locale.language,
207 locale.territory,
208 locale.language,
209 )
210 headers['Accept-Language'] = ac_lang
211
212 get_sc_url = base_url + '/?sc=%s' % (sc_code)
213 logger.debug("query new sc time-stamp ... %s", get_sc_url)
214 logger.debug("headers: %s", headers)
215 resp = get(get_sc_url, headers=headers)
216
217 # ?? x = network.get('https://www.startpage.com/sp/cdn/images/filter-chevron.svg', headers=headers)
218 # ?? https://www.startpage.com/sp/cdn/images/filter-chevron.svg
219 # ?? ping-back URL: https://www.startpage.com/sp/pb?sc=TLsB0oITjZ8F21
220
221 if str(resp.url).startswith('https://www.startpage.com/sp/captcha'): # type: ignore
222 raise SearxEngineCaptchaException(
223 message="get_sc_code: got redirected to https://www.startpage.com/sp/captcha",
224 )
225
226 dom = lxml.html.fromstring(resp.text) # type: ignore
227
228 try:
229 sc_code = eval_xpath(dom, search_form_xpath + '//input[@name="sc"]/@value')[0]
230 except IndexError as exc:
231 logger.debug("suspend startpage API --> https://github.com/searxng/searxng/pull/695")
232 raise SearxEngineCaptchaException(
233 message="get_sc_code: [PR-695] query new sc time-stamp failed! (%s)" % resp.url, # type: ignore
234 ) from exc
235
236 sc_code_ts = time()
237 logger.debug("get_sc_code: new value is: %s", sc_code)
238 return sc_code
239
240

Referenced by request().

+ Here is the caller graph for this function:

◆ request()

searx.engines.startpage.request ( query,
params )
Assemble a Startpage request.

To avoid CAPTCHA we need to send a well formed HTTP POST request with a
cookie.  We need to form a request that is identical to the request build by
Startpage's search form:

- in the cookie the **region** is selected
- in the HTTP POST data the **language** is selected

Additionally the arguments form Startpage's search form needs to be set in
HTML POST data / compare ``<input>`` elements: :py:obj:`search_form_xpath`.

Definition at line 241 of file startpage.py.

241def request(query, params):
242 """Assemble a Startpage request.
243
244 To avoid CAPTCHA we need to send a well formed HTTP POST request with a
245 cookie. We need to form a request that is identical to the request build by
246 Startpage's search form:
247
248 - in the cookie the **region** is selected
249 - in the HTTP POST data the **language** is selected
250
251 Additionally the arguments form Startpage's search form needs to be set in
252 HTML POST data / compare ``<input>`` elements: :py:obj:`search_form_xpath`.
253 """
254 engine_region = traits.get_region(params['searxng_locale'], 'en-US')
255 engine_language = traits.get_language(params['searxng_locale'], 'en')
256
257 # build arguments
258 args = {
259 'query': query,
260 'cat': startpage_categ,
261 't': 'device',
262 'sc': get_sc_code(params['searxng_locale'], params), # hint: this func needs HTTP headers,
263 'with_date': time_range_dict.get(params['time_range'], ''),
264 }
265
266 if engine_language:
267 args['language'] = engine_language
268 args['lui'] = engine_language
269
270 args['abp'] = '1'
271 if params['pageno'] > 1:
272 args['page'] = params['pageno']
273
274 # build cookie
275 lang_homepage = 'en'
276 cookie = OrderedDict()
277 cookie['date_time'] = 'world'
278 cookie['disable_family_filter'] = safesearch_dict[params['safesearch']]
279 cookie['disable_open_in_new_window'] = '0'
280 cookie['enable_post_method'] = '1' # hint: POST
281 cookie['enable_proxy_safety_suggest'] = '1'
282 cookie['enable_stay_control'] = '1'
283 cookie['instant_answers'] = '1'
284 cookie['lang_homepage'] = 's/device/%s/' % lang_homepage
285 cookie['num_of_results'] = '10'
286 cookie['suggestions'] = '1'
287 cookie['wt_unit'] = 'celsius'
288
289 if engine_language:
290 cookie['language'] = engine_language
291 cookie['language_ui'] = engine_language
292
293 if engine_region:
294 cookie['search_results_region'] = engine_region
295
296 params['cookies']['preferences'] = 'N1N'.join(["%sEEE%s" % x for x in cookie.items()])
297 logger.debug('cookie preferences: %s', params['cookies']['preferences'])
298
299 # POST request
300 logger.debug("data: %s", args)
301 params['data'] = args
302 params['method'] = 'POST'
303 params['url'] = search_url
304 params['headers']['Origin'] = base_url
305 params['headers']['Referer'] = base_url + '/'
306 # is the Accept header needed?
307 # params['headers']['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
308
309 return params
310
311

References get_sc_code().

+ Here is the call graph for this function:

◆ response()

searx.engines.startpage.response ( resp)

Definition at line 406 of file startpage.py.

406def response(resp):
407 categ = startpage_categ.capitalize()
408 results_raw = '{' + extr(resp.text, f"React.createElement(UIStartpage.AppSerp{categ}, {{", '}})') + '}}'
409 results_json = loads(results_raw)
410 results_obj = results_json.get('render', {}).get('presenter', {}).get('regions', {})
411
412 results = []
413 for results_categ in results_obj.get('mainline', []):
414 for item in results_categ.get('results', []):
415 if results_categ['display_type'] == 'web-google':
416 results.append(_get_web_result(item))
417 elif results_categ['display_type'] == 'news-bing':
418 results.append(_get_news_result(item))
419 elif 'images' in results_categ['display_type']:
420 item = _get_image_result(item)
421 if item:
422 results.append(item)
423
424 return results
425
426

References _get_image_result(), _get_news_result(), and _get_web_result().

+ Here is the call graph for this function:

Variable Documentation

◆ about

dict searx.engines.startpage.about
Initial value:
1= {
2 "website": 'https://startpage.com',
3 "wikidata_id": 'Q2333295',
4 "official_api_documentation": None,
5 "use_official_api": False,
6 "require_api_key": False,
7 "results": 'HTML',
8}

Definition at line 109 of file startpage.py.

◆ base_url

str searx.engines.startpage.base_url = 'https://www.startpage.com'

Definition at line 141 of file startpage.py.

◆ categories

list searx.engines.startpage.categories = ['general', 'web']

Definition at line 129 of file startpage.py.

◆ logger

logging searx.engines.startpage.logger .Logger

Definition at line 104 of file startpage.py.

◆ max_page

int searx.engines.startpage.max_page = 18

Definition at line 131 of file startpage.py.

◆ paging

bool searx.engines.startpage.paging = True

Definition at line 130 of file startpage.py.

◆ safesearch

bool searx.engines.startpage.safesearch = True

Definition at line 135 of file startpage.py.

◆ safesearch_dict

dict searx.engines.startpage.safesearch_dict = {0: '0', 1: '1', 2: '1'}

Definition at line 138 of file startpage.py.

◆ sc_code

str searx.engines.startpage.sc_code = ''

Definition at line 164 of file startpage.py.

◆ sc_code_cache_sec

int searx.engines.startpage.sc_code_cache_sec = 30

Definition at line 165 of file startpage.py.

◆ sc_code_ts

int searx.engines.startpage.sc_code_ts = 0

Definition at line 163 of file startpage.py.

◆ search_form_xpath

str searx.engines.startpage.search_form_xpath = '//form[@id="search"]'

Definition at line 147 of file startpage.py.

◆ search_url

str searx.engines.startpage.search_url = base_url + '/sp/search'

Definition at line 142 of file startpage.py.

◆ send_accept_language_header

bool searx.engines.startpage.send_accept_language_header = True

Definition at line 122 of file startpage.py.

◆ startpage_categ

str searx.engines.startpage.startpage_categ = 'web'

Definition at line 118 of file startpage.py.

◆ time_range_dict

dict searx.engines.startpage.time_range_dict = {'day': 'd', 'week': 'w', 'month': 'm', 'year': 'y'}

Definition at line 137 of file startpage.py.

◆ time_range_support

bool searx.engines.startpage.time_range_support = True

Definition at line 134 of file startpage.py.