.oO SearXNG Developer Documentation Oo.
Loading...
Searching...
No Matches
searx.engines.google_news Namespace Reference

Functions

 request (query, params)
 
 response (resp)
 
 fetch_traits (EngineTraits engine_traits)
 

Variables

logging logger .Logger
 
dict about
 
list categories = ['news']
 
bool paging = False
 
bool time_range_support = False
 
bool safesearch = True
 
list ceid_list
 
list _skip_values
 
dict _ceid_locale_map = {'NO:no': 'nb-NO'}
 

Detailed Description

This is the implementation of the Google News engine.

Google News has a different region handling compared to Google WEB.

- the ``ceid`` argument has to be set (:py:obj:`ceid_list`)
- the hl_ argument has to be set correctly (and different to Google WEB)
- the gl_ argument is mandatory

If one of this argument is not set correctly, the request is redirected to
CONSENT dialog::

  https://consent.google.com/m?continue=

The google news API ignores some parameters from the common :ref:`google API`:

- num_ : the number of search results is ignored / there is no paging all
  results for a query term are in the first response.
- save_ : is ignored / Google-News results are always *SafeSearch*

.. _hl: https://developers.google.com/custom-search/docs/xml_results#hlsp
.. _gl: https://developers.google.com/custom-search/docs/xml_results#glsp
.. _num: https://developers.google.com/custom-search/docs/xml_results#numsp
.. _save: https://developers.google.com/custom-search/docs/xml_results#safesp

Function Documentation

◆ fetch_traits()

searx.engines.google_news.fetch_traits ( EngineTraits engine_traits)

Definition at line 282 of file google_news.py.

282def fetch_traits(engine_traits: EngineTraits):
283 _fetch_traits(engine_traits, add_domains=False)
284
285 engine_traits.custom['ceid'] = {}
286
287 for ceid in ceid_list:
288 if ceid in _skip_values:
289 continue
290
291 region, lang = ceid.split(':')
292 x = lang.split('-')
293 if len(x) > 1:
294 if x[1] not in ['Hant', 'Hans']:
295 lang = x[0]
296
297 sxng_locale = _ceid_locale_map.get(ceid, lang + '-' + region)
298 try:
299 locale = babel.Locale.parse(sxng_locale, sep='-')
300 except babel.UnknownLocaleError:
301 print("ERROR: %s -> %s is unknown by babel" % (ceid, sxng_locale))
302 continue
303
304 engine_traits.custom['ceid'][locales.region_tag(locale)] = ceid

◆ request()

searx.engines.google_news.request ( query,
params )
Google-News search request

Definition at line 79 of file google_news.py.

79def request(query, params):
80 """Google-News search request"""
81
82 sxng_locale = params.get('searxng_locale', 'en-US')
83 ceid = locales.get_engine_locale(sxng_locale, traits.custom['ceid'], default='US:en')
84 google_info = get_google_info(params, traits)
85 google_info['subdomain'] = 'news.google.com' # google news has only one domain
86
87 ceid_region, ceid_lang = ceid.split(':')
88 ceid_lang, ceid_suffix = (
89 ceid_lang.split('-')
90 + [
91 None,
92 ]
93 )[:2]
94
95 google_info['params']['hl'] = ceid_lang
96
97 if ceid_suffix and ceid_suffix not in ['Hans', 'Hant']:
98
99 if ceid_region.lower() == ceid_lang:
100 google_info['params']['hl'] = ceid_lang + '-' + ceid_region
101 else:
102 google_info['params']['hl'] = ceid_lang + '-' + ceid_suffix
103
104 elif ceid_region.lower() != ceid_lang:
105
106 if ceid_region in ['AT', 'BE', 'CH', 'IL', 'SA', 'IN', 'BD', 'PT']:
107 google_info['params']['hl'] = ceid_lang
108 else:
109 google_info['params']['hl'] = ceid_lang + '-' + ceid_region
110
111 google_info['params']['lr'] = 'lang_' + ceid_lang.split('-')[0]
112 google_info['params']['gl'] = ceid_region
113
114 query_url = (
115 'https://'
116 + google_info['subdomain']
117 + "/search?"
118 + urlencode(
119 {
120 'q': query,
121 **google_info['params'],
122 }
123 )
124 # ceid includes a ':' character which must not be urlencoded
125 + ('&ceid=%s' % ceid)
126 )
127
128 params['url'] = query_url
129 params['cookies'] = google_info['cookies']
130 params['headers'].update(google_info['headers'])
131 return params
132
133

◆ response()

searx.engines.google_news.response ( resp)
Get response from google's search request

Definition at line 134 of file google_news.py.

134def response(resp):
135 """Get response from google's search request"""
136 results = []
137 detect_google_sorry(resp)
138
139 # convert the text to dom
140 dom = html.fromstring(resp.text)
141
142 for result in eval_xpath_list(dom, '//div[@class="xrnccd"]'):
143
144 # The first <a> tag in the <article> contains the link to the article
145 # The href attribute of the <a> tag is a google internal link, we have
146 # to decode
147
148 href = eval_xpath_getindex(result, './article/a/@href', 0)
149 href = href.split('?')[0]
150 href = href.split('/')[-1]
151 href = base64.urlsafe_b64decode(href + '====')
152 href = href[href.index(b'http') :].split(b'\xd2')[0]
153 href = href.decode()
154
155 title = extract_text(eval_xpath(result, './article/h3[1]'))
156
157 # The pub_date is mostly a string like 'yesterday', not a real
158 # timezone date or time. Therefore we can't use publishedDate.
159 pub_date = extract_text(eval_xpath(result, './article//time'))
160 pub_origin = extract_text(eval_xpath(result, './article//a[@data-n-tid]'))
161
162 content = ' / '.join([x for x in [pub_origin, pub_date] if x])
163
164 # The image URL is located in a preceding sibling <img> tag, e.g.:
165 # "https://lh3.googleusercontent.com/DjhQh7DMszk.....z=-p-h100-w100"
166 # These URL are long but not personalized (double checked via tor).
167
168 thumbnail = extract_text(result.xpath('preceding-sibling::a/figure/img/@src'))
169
170 results.append(
171 {
172 'url': href,
173 'title': title,
174 'content': content,
175 'thumbnail': thumbnail,
176 }
177 )
178
179 # return results
180 return results
181
182

Variable Documentation

◆ _ceid_locale_map

dict searx.engines.google_news._ceid_locale_map = {'NO:no': 'nb-NO'}
protected

Definition at line 279 of file google_news.py.

◆ _skip_values

list searx.engines.google_news._skip_values
protected
Initial value:
1= [
2 'ET:en', # english (ethiopia)
3 'ID:en', # english (indonesia)
4 'LV:en', # english (latvia)
5]

Definition at line 273 of file google_news.py.

◆ about

dict searx.engines.google_news.about
Initial value:
1= {
2 "website": 'https://news.google.com',
3 "wikidata_id": 'Q12020',
4 "official_api_documentation": 'https://developers.google.com/custom-search',
5 "use_official_api": False,
6 "require_api_key": False,
7 "results": 'HTML',
8}

Definition at line 57 of file google_news.py.

◆ categories

list searx.engines.google_news.categories = ['news']

Definition at line 67 of file google_news.py.

◆ ceid_list

list searx.engines.google_news.ceid_list

Definition at line 183 of file google_news.py.

◆ logger

logging searx.engines.google_news.logger .Logger

Definition at line 52 of file google_news.py.

◆ paging

bool searx.engines.google_news.paging = False

Definition at line 68 of file google_news.py.

◆ safesearch

bool searx.engines.google_news.safesearch = True

Definition at line 75 of file google_news.py.

◆ time_range_support

bool searx.engines.google_news.time_range_support = False

Definition at line 69 of file google_news.py.