.oO SearXNG Developer Documentation Oo.
Loading...
Searching...
No Matches
searx.utils._HTMLTextExtractor Class Reference
+ Inheritance diagram for searx.utils._HTMLTextExtractor:
+ Collaboration diagram for searx.utils._HTMLTextExtractor:

Public Member Functions

 __init__ (self)
 
 handle_starttag (self, tag, attrs)
 
 handle_endtag (self, tag)
 
 is_valid_tag (self)
 
 handle_data (self, data)
 
 handle_charref (self, name)
 
 handle_entityref (self, name)
 
 get_text (self)
 
 error (self, message)
 

Public Attributes

 result
 
 tags
 

Detailed Description

Internal class to extract text from HTML

Definition at line 91 of file utils.py.

Constructor & Destructor Documentation

◆ __init__()

searx.utils._HTMLTextExtractor.__init__ ( self)

Definition at line 94 of file utils.py.

94 def __init__(self):
95 HTMLParser.__init__(self)
96 self.result = []
97 self.tags = []
98

Member Function Documentation

◆ error()

searx.utils._HTMLTextExtractor.error ( self,
message )

Definition at line 140 of file utils.py.

140 def error(self, message):
141 # error handle is needed in <py3.10
142 # https://github.com/python/cpython/pull/8562/files
143 raise AssertionError(message)
144
145

◆ get_text()

searx.utils._HTMLTextExtractor.get_text ( self)

Definition at line 137 of file utils.py.

137 def get_text(self):
138 return ''.join(self.result).strip()
139

References searx.utils._HTMLTextExtractor.result.

◆ handle_charref()

searx.utils._HTMLTextExtractor.handle_charref ( self,
name )

Definition at line 121 of file utils.py.

121 def handle_charref(self, name):
122 if not self.is_valid_tag():
123 return
124 if name[0] in ('x', 'X'):
125 codepoint = int(name[1:], 16)
126 else:
127 codepoint = int(name)
128 self.result.append(chr(codepoint))
129

References searx.utils._HTMLTextExtractor.is_valid_tag(), and searx.utils._HTMLTextExtractor.result.

+ Here is the call graph for this function:

◆ handle_data()

searx.utils._HTMLTextExtractor.handle_data ( self,
data )

Definition at line 116 of file utils.py.

116 def handle_data(self, data):
117 if not self.is_valid_tag():
118 return
119 self.result.append(data)
120

References searx.utils._HTMLTextExtractor.is_valid_tag(), and searx.utils._HTMLTextExtractor.result.

+ Here is the call graph for this function:

◆ handle_endtag()

searx.utils._HTMLTextExtractor.handle_endtag ( self,
tag )

Definition at line 104 of file utils.py.

104 def handle_endtag(self, tag):
105 if not self.tags:
106 return
107
108 if tag != self.tags[-1]:
109 raise _HTMLTextExtractorException()
110
111 self.tags.pop()
112

References searx.utils._HTMLTextExtractor.tags.

◆ handle_entityref()

searx.utils._HTMLTextExtractor.handle_entityref ( self,
name )

Definition at line 130 of file utils.py.

130 def handle_entityref(self, name):
131 if not self.is_valid_tag():
132 return
133 # codepoint = htmlentitydefs.name2codepoint[name]
134 # self.result.append(chr(codepoint))
135 self.result.append(name)
136

References searx.utils._HTMLTextExtractor.is_valid_tag(), and searx.utils._HTMLTextExtractor.result.

+ Here is the call graph for this function:

◆ handle_starttag()

searx.utils._HTMLTextExtractor.handle_starttag ( self,
tag,
attrs )

Definition at line 99 of file utils.py.

99 def handle_starttag(self, tag, attrs):
100 self.tags.append(tag)
101 if tag == 'br':
102 self.result.append(' ')
103

References searx.utils._HTMLTextExtractor.result, and searx.utils._HTMLTextExtractor.tags.

◆ is_valid_tag()

searx.utils._HTMLTextExtractor.is_valid_tag ( self)

Definition at line 113 of file utils.py.

113 def is_valid_tag(self):
114 return not self.tags or self.tags[-1] not in _BLOCKED_TAGS
115

References searx.utils._HTMLTextExtractor.tags.

Referenced by searx.utils._HTMLTextExtractor.handle_charref(), searx.utils._HTMLTextExtractor.handle_data(), and searx.utils._HTMLTextExtractor.handle_entityref().

+ Here is the caller graph for this function:

Member Data Documentation

◆ result

◆ tags


The documentation for this class was generated from the following file: