.oO SearXNG Developer Documentation Oo.
Loading...
Searching...
No Matches
searx.utils.HTMLTextExtractor Class Reference
Inheritance diagram for searx.utils.HTMLTextExtractor:
Collaboration diagram for searx.utils.HTMLTextExtractor:

Public Member Functions

 __init__ (self)
None handle_starttag (self, str tag, list[tuple[str, str|None]] attrs)
None handle_endtag (self, str tag)
 is_valid_tag (self)
None handle_data (self, str data)
None handle_charref (self, str name)
None handle_entityref (self, str name)
 get_text (self)
None error (self, str message)

Public Attributes

list result = []
list tags = []

Detailed Description

Internal class to extract text from HTML

Definition at line 86 of file utils.py.

Constructor & Destructor Documentation

◆ __init__()

searx.utils.HTMLTextExtractor.__init__ ( self)

Definition at line 89 of file utils.py.

89 def __init__(self):
90 HTMLParser.__init__(self)
91 self.result: list[str] = []
92 self.tags: list[str] = []
93

Member Function Documentation

◆ error()

None searx.utils.HTMLTextExtractor.error ( self,
str message )

Definition at line 136 of file utils.py.

136 def error(self, message: str) -> None:
137 # error handle is needed in <py3.10
138 # https://github.com/python/cpython/pull/8562/files
139 raise AssertionError(message)
140
141

◆ get_text()

searx.utils.HTMLTextExtractor.get_text ( self)

Definition at line 133 of file utils.py.

133 def get_text(self):
134 return ''.join(self.result).strip()
135

References result.

◆ handle_charref()

None searx.utils.HTMLTextExtractor.handle_charref ( self,
str name )

Definition at line 117 of file utils.py.

117 def handle_charref(self, name: str) -> None:
118 if not self.is_valid_tag():
119 return
120 if name[0] in ('x', 'X'):
121 codepoint = int(name[1:], 16)
122 else:
123 codepoint = int(name)
124 self.result.append(chr(codepoint))
125

References is_valid_tag(), and result.

Here is the call graph for this function:

◆ handle_data()

None searx.utils.HTMLTextExtractor.handle_data ( self,
str data )

Definition at line 112 of file utils.py.

112 def handle_data(self, data: str) -> None:
113 if not self.is_valid_tag():
114 return
115 self.result.append(data)
116

References is_valid_tag(), and result.

Here is the call graph for this function:

◆ handle_endtag()

None searx.utils.HTMLTextExtractor.handle_endtag ( self,
str tag )

Definition at line 99 of file utils.py.

99 def handle_endtag(self, tag: str) -> None:
100 if not self.tags:
101 return
102
103 if tag != self.tags[-1]:
104 self.result.append(f"</{tag}>")
105 return
106
107 self.tags.pop()
108

References result, and tags.

◆ handle_entityref()

None searx.utils.HTMLTextExtractor.handle_entityref ( self,
str name )

Definition at line 126 of file utils.py.

126 def handle_entityref(self, name: str) -> None:
127 if not self.is_valid_tag():
128 return
129 # codepoint = htmlentitydefs.name2codepoint[name]
130 # self.result.append(chr(codepoint))
131 self.result.append(name)
132

References is_valid_tag(), and result.

Here is the call graph for this function:

◆ handle_starttag()

None searx.utils.HTMLTextExtractor.handle_starttag ( self,
str tag,
list[tuple[str, str | None]] attrs )

Definition at line 94 of file utils.py.

94 def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None:
95 self.tags.append(tag)
96 if tag == 'br':
97 self.result.append(' ')
98

References result, and tags.

◆ is_valid_tag()

searx.utils.HTMLTextExtractor.is_valid_tag ( self)

Definition at line 109 of file utils.py.

109 def is_valid_tag(self):
110 return not self.tags or self.tags[-1] not in _BLOCKED_TAGS
111

References tags.

Referenced by handle_charref(), handle_data(), and handle_entityref().

Here is the caller graph for this function:

Member Data Documentation

◆ result

searx.utils.HTMLTextExtractor.result = []

◆ tags

list searx.utils.HTMLTextExtractor.tags = []

Definition at line 92 of file utils.py.

Referenced by handle_endtag(), handle_starttag(), and is_valid_tag().


The documentation for this class was generated from the following file:
  • /home/andrew/Documents/code/public/searxng/searx/utils.py