.oO SearXNG Developer Documentation Oo.
Loading...
Searching...
No Matches
searx.utils._HTMLTextExtractor Class Reference
Inheritance diagram for searx.utils._HTMLTextExtractor:
Collaboration diagram for searx.utils._HTMLTextExtractor:

Public Member Functions

 __init__ (self)
 handle_starttag (self, tag, attrs)
 handle_endtag (self, tag)
 is_valid_tag (self)
 handle_data (self, data)
 handle_charref (self, name)
 handle_entityref (self, name)
 get_text (self)
 error (self, message)

Public Attributes

list result = []
list tags = []

Detailed Description

Internal class to extract text from HTML

Definition at line 81 of file utils.py.

Constructor & Destructor Documentation

◆ __init__()

searx.utils._HTMLTextExtractor.__init__ ( self)

Definition at line 84 of file utils.py.

84 def __init__(self):
85 HTMLParser.__init__(self)
86 self.result = []
87 self.tags = []
88

Member Function Documentation

◆ error()

searx.utils._HTMLTextExtractor.error ( self,
message )

Definition at line 130 of file utils.py.

130 def error(self, message):
131 # error handle is needed in <py3.10
132 # https://github.com/python/cpython/pull/8562/files
133 raise AssertionError(message)
134
135

◆ get_text()

searx.utils._HTMLTextExtractor.get_text ( self)

Definition at line 127 of file utils.py.

127 def get_text(self):
128 return ''.join(self.result).strip()
129

References result.

◆ handle_charref()

searx.utils._HTMLTextExtractor.handle_charref ( self,
name )

Definition at line 111 of file utils.py.

111 def handle_charref(self, name):
112 if not self.is_valid_tag():
113 return
114 if name[0] in ('x', 'X'):
115 codepoint = int(name[1:], 16)
116 else:
117 codepoint = int(name)
118 self.result.append(chr(codepoint))
119

References is_valid_tag(), and result.

Here is the call graph for this function:

◆ handle_data()

searx.utils._HTMLTextExtractor.handle_data ( self,
data )

Definition at line 106 of file utils.py.

106 def handle_data(self, data):
107 if not self.is_valid_tag():
108 return
109 self.result.append(data)
110

References is_valid_tag(), and result.

Here is the call graph for this function:

◆ handle_endtag()

searx.utils._HTMLTextExtractor.handle_endtag ( self,
tag )

Definition at line 94 of file utils.py.

94 def handle_endtag(self, tag):
95 if not self.tags:
96 return
97
98 if tag != self.tags[-1]:
99 raise _HTMLTextExtractorException()
100
101 self.tags.pop()
102

References tags.

◆ handle_entityref()

searx.utils._HTMLTextExtractor.handle_entityref ( self,
name )

Definition at line 120 of file utils.py.

120 def handle_entityref(self, name):
121 if not self.is_valid_tag():
122 return
123 # codepoint = htmlentitydefs.name2codepoint[name]
124 # self.result.append(chr(codepoint))
125 self.result.append(name)
126

References is_valid_tag(), and result.

Here is the call graph for this function:

◆ handle_starttag()

searx.utils._HTMLTextExtractor.handle_starttag ( self,
tag,
attrs )

Definition at line 89 of file utils.py.

89 def handle_starttag(self, tag, attrs):
90 self.tags.append(tag)
91 if tag == 'br':
92 self.result.append(' ')
93

References result, and tags.

◆ is_valid_tag()

searx.utils._HTMLTextExtractor.is_valid_tag ( self)

Definition at line 103 of file utils.py.

103 def is_valid_tag(self):
104 return not self.tags or self.tags[-1] not in _BLOCKED_TAGS
105

References tags.

Referenced by handle_charref(), handle_data(), and handle_entityref().

Here is the caller graph for this function:

Member Data Documentation

◆ result

searx.utils._HTMLTextExtractor.result = []

Definition at line 86 of file utils.py.

Referenced by get_text(), handle_charref(), handle_data(), handle_entityref(), and handle_starttag().

◆ tags

list searx.utils._HTMLTextExtractor.tags = []

Definition at line 87 of file utils.py.

Referenced by handle_endtag(), handle_starttag(), and is_valid_tag().


The documentation for this class was generated from the following file:
  • /home/andrew/Documents/code/public/searxng/searx/utils.py