PYTHON CODE
Can you explain the data you used? what are we achieving?
along with Comments / Analysis
# Decision Tree Classification
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv(‘Social_Network_Ads.csv’)
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Fitting Decision Tree Classification to the Training set
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = ‘entropy’, random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() – 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() – 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap((‘red’, ‘green’)))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap((‘red’, ‘green’))(i), label = j)
plt.title(‘Decision Tree Classification (Training set)’)
plt.xlabel(‘Age’)
plt.ylabel(‘Estimated Salary’)
plt.legend()
plt.show()
# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() – 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() – 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap((‘red’, ‘green’)))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap((‘red’, ‘green’))(i), label = j)
plt.title(‘Decision Tree Classification (Test set)’)
plt.xlabel(‘Age’)
plt.ylabel(‘Estimated Salary’)
plt.legend()
plt.show()
User ID,Gender,Age,EstimatedSalary,Purchased
15624510,Male,19,19000,0
15810944,Male,35,20000,0
15668575,Female,26,43000,0
15603246,Female,27,57000,0
15804002,Male,19,76000,0
15728773,Male,27,58000,0
15598044,Female,27,84000,0
15694829,Female,32,150000,1
15600575,Male,25,33000,0
15727311,Female,35,65000,0
15570769,Female,26,80000,0
15606274,Female,26,52000,0
15746139,Male,20,86000,0
15704987,Male,32,18000,0
15628972,Male,18,82000,0
15697686,Male,29,80000,0
15733883,Male,47,25000,1
15617482,Male,45,26000,1
15704583,Male,46,28000,1
15621083,Female,48,29000,1
15649487,Male,45,22000,1
15736760,Female,47,49000,1
15714658,Male,48,41000,1
15599081,Female,45,22000,1
15705113,Male,46,23000,1
15631159,Male,47,20000,1
15792818,Male,49,28000,1
15633531,Female,47,30000,1
15744529,Male,29,43000,0
15669656,Male,31,18000,0
15581198,Male,31,74000,0
15729054,Female,27,137000,1
15573452,Female,21,16000,0
15776733,Female,28,44000,0
15724858,Male,27,90000,0
15713144,Male,35,27000,0
15690188,Female,33,28000,0
15689425,Male,30,49000,0
15671766,Female,26,72000,0
15782806,Female,27,31000,0
15764419,Female,27,17000,0
15591915,Female,33,51000,0
15772798,Male,35,108000,0
15792008,Male,30,15000,0
15715541,Female,28,84000,0
15639277,Male,23,20000,0
15798850,Male,25,79000,0
15776348,Female,27,54000,0
15727696,Male,30,135000,1
15793813,Female,31,89000,0
15694395,Female,24,32000,0
15764195,Female,18,44000,0
15744919,Female,29,83000,0
15671655,Female,35,23000,0
15654901,Female,27,58000,0
15649136,Female,24,55000,0
15775562,Female,23,48000,0
15807481,Male,28,79000,0
15642885,Male,22,18000,0
15789109,Female,32,117000,0
15814004,Male,27,20000,0
15673619,Male,25,87000,0
15595135,Female,23,66000,0
15583681,Male,32,120000,1
15605000,Female,59,83000,0
15718071,Male,24,58000,0
15679760,Male,24,19000,0
15654574,Female,23,82000,0
15577178,Female,22,63000,0
15595324,Female,31,68000,0
15756932,Male,25,80000,0
15726358,Female,24,27000,0
15595228,Female,20,23000,0
15782530,Female,33,113000,0
15592877,Male,32,18000,0
15651983,Male,34,112000,1
15746737,Male,18,52000,0
15774179,Female,22,27000,0
15667265,Female,28,87000,0
15655123,Female,26,17000,0
15595917,Male,30,80000,0
15668385,Male,39,42000,0
15709476,Male,20,49000,0
15711218,Male,35,88000,0
15798659,Female,30,62000,0
15663939,Female,31,118000,1
15694946,Male,24,55000,0
15631912,Female,28,85000,0
15768816,Male,26,81000,0
15682268,Male,35,50000,0
15684801,Male,22,81000,0
15636428,Female,30,116000,0
15809823,Male,26,15000,0
15699284,Female,29,28000,0
15786993,Female,29,83000,0
15709441,Female,35,44000,0
15710257,Female,35,25000,0
15582492,Male,28,123000,1
15575694,Male,35,73000,0
15756820,Female,28,37000,0
15766289,Male,27,88000,0
15593014,Male,28,59000,0
15584545,Female,32,86000,0
15675949,Female,33,149000,1
15672091,Female,19,21000,0
15801658,Male,21,72000,0
15706185,Female,26,35000,0
15789863,Male,27,89000,0
15720943,Male,26,86000,0
15697997,Female,38,80000,0
15665416,Female,39,71000,0
15660200,Female,37,71000,0
15619653,Male,38,61000,0
15773447,Male,37,55000,0
15739160,Male,42,80000,0
15689237,Male,40,57000,0
15679297,Male,35,75000,0
15591433,Male,36,52000,0
15642725,Male,40,59000,0
15701962,Male,41,59000,0
15811613,Female,36,75000,0
15741049,Male,37,72000,0
15724423,Female,40,75000,0
15574305,Male,35,53000,0
15678168,Female,41,51000,0
15697020,Female,39,61000,0
15610801,Male,42,65000,0
15745232,Male,26,32000,0
15722758,Male,30,17000,0
15792102,Female,26,84000,0
15675185,Male,31,58000,0
15801247,Male,33,31000,0
15725660,Male,30,87000,0
15638963,Female,21,68000,0
15800061,Female,28,55000,0
15578006,Male,23,63000,0
15668504,Female,20,82000,0
15687491,Male,30,107000,1
15610403,Female,28,59000,0
15741094,Male,19,25000,0
15807909,Male,19,85000,0
15666141,Female,18,68000,0
15617134,Male,35,59000,0
15783029,Male,30,89000,0
15622833,Female,34,25000,0
15746422,Female,24,89000,0
15750839,Female,27,96000,1
15749130,Female,41,30000,0
15779862,Male,29,61000,0
15767871,Male,20,74000,0
15679651,Female,26,15000,0
15576219,Male,41,45000,0
15699247,Male,31,76000,0
15619087,Female,36,50000,0
15605327,Male,40,47000,0
15610140,Female,31,15000,0
15791174,Male,46,59000,0
15602373,Male,29,75000,0
15762605,Male,26,30000,0
15598840,Female,32,135000,1
15744279,Male,32,100000,1
15670619,Male,25,90000,0
15599533,Female,37,33000,0
15757837,Male,35,38000,0
15697574,Female,33,69000,0
15578738,Female,18,86000,0
15762228,Female,22,55000,0
15614827,Female,35,71000,0
15789815,Male,29,148000,1
15579781,Female,29,47000,0
15587013,Male,21,88000,0
15570932,Male,34,115000,0
15794661,Female,26,118000,0
15581654,Female,34,43000,0
15644296,Female,34,72000,0
15614420,Female,23,28000,0
15609653,Female,35,47000,0
15594577,Male,25,22000,0
15584114,Male,24,23000,0
15673367,Female,31,34000,0
15685576,Male,26,16000,0
15774727,Female,31,71000,0
15694288,Female,32,117000,1
15603319,Male,33,43000,0
15759066,Female,33,60000,0
15814816,Male,31,66000,0
15724402,Female,20,82000,0
15571059,Female,33,41000,0
15674206,Male,35,72000,0
15715160,Male,28,32000,0
15730448,Male,24,84000,0
15662067,Female,19,26000,0
15779581,Male,29,43000,0
15662901,Male,19,70000,0
15689751,Male,28,89000,0
15667742,Male,34,43000,0
15738448,Female,30,79000,0
15680243,Female,20,36000,0
15745083,Male,26,80000,0
15708228,Male,35,22000,0
15628523,Male,35,39000,0
15708196,Male,49,74000,0
15735549,Female,39,134000,1
15809347,Female,41,71000,0
15660866,Female,58,101000,1
15766609,Female,47,47000,0
15654230,Female,55,130000,1
15794566,Female,52,114000,0
15800890,Female,40,142000,1
15697424,Female,46,22000,0
15724536,Female,48,96000,1
15735878,Male,52,150000,1
15707596,Female,59,42000,0
15657163,Male,35,58000,0
15622478,Male,47,43000,0
15779529,Female,60,108000,1
15636023,Male,49,65000,0
15582066,Male,40,78000,0
15666675,Female,46,96000,0
15732987,Male,59,143000,1
15789432,Female,41,80000,0
15663161,Male,35,91000,1
15694879,Male,37,144000,1
15593715,Male,60,102000,1
15575002,Female,35,60000,0
15622171,Male,37,53000,0
15795224,Female,36,126000,1
15685346,Male,56,133000,1
15691808,Female,40,72000,0
15721007,Female,42,80000,1
15794253,Female,35,147000,1
15694453,Male,39,42000,0
15813113,Male,40,107000,1
15614187,Male,49,86000,1
15619407,Female,38,112000,0
15646227,Male,46,79000,1
15660541,Male,40,57000,0
15753874,Female,37,80000,0
15617877,Female,46,82000,0
15772073,Female,53,143000,1
15701537,Male,42,149000,1
15736228,Male,38,59000,0
15780572,Female,50,88000,1
15769596,Female,56,104000,1
15586996,Female,41,72000,0
15722061,Female,51,146000,1
15638003,Female,35,50000,0
15775590,Female,57,122000,1
15730688,Male,41,52000,0
15753102,Female,35,97000,1
15810075,Female,44,39000,0
15723373,Male,37,52000,0
15795298,Female,48,134000,1
15584320,Female,37,146000,1
15724161,Female,50,44000,0
15750056,Female,52,90000,1
15609637,Female,41,72000,0
15794493,Male,40,57000,0
15569641,Female,58,95000,1
15815236,Female,45,131000,1
15811177,Female,35,77000,0
15680587,Male,36,144000,1
15672821,Female,55,125000,1
15767681,Female,35,72000,0
15600379,Male,48,90000,1
15801336,Female,42,108000,1
15721592,Male,40,75000,0
15581282,Male,37,74000,0
15746203,Female,47,144000,1
15583137,Male,40,61000,0
15680752,Female,43,133000,0
15688172,Female,59,76000,1
15791373,Male,60,42000,1
15589449,Male,39,106000,1
15692819,Female,57,26000,1
15727467,Male,57,74000,1
15734312,Male,38,71000,0
15764604,Male,49,88000,1
15613014,Female,52,38000,1
15759684,Female,50,36000,1
15609669,Female,59,88000,1
15685536,Male,35,61000,0
15750447,Male,37,70000,1
15663249,Female,52,21000,1
15638646,Male,48,141000,0
15734161,Female,37,93000,1
15631070,Female,37,62000,0
15761950,Female,48,138000,1
15649668,Male,41,79000,0
15713912,Female,37,78000,1
15586757,Male,39,134000,1
15596522,Male,49,89000,1
15625395,Male,55,39000,1
15760570,Male,37,77000,0
15566689,Female,35,57000,0
15725794,Female,36,63000,0
15673539,Male,42,73000,1
15705298,Female,43,112000,1
15675791,Male,45,79000,0
15747043,Male,46,117000,1
15736397,Female,58,38000,1
15678201,Male,48,74000,1
15720745,Female,37,137000,1
15637593,Male,37,79000,1
15598070,Female,40,60000,0
15787550,Male,42,54000,0
15603942,Female,51,134000,0
15733973,Female,47,113000,1
15596761,Male,36,125000,1
15652400,Female,38,50000,0
15717893,Female,42,70000,0
15622585,Male,39,96000,1
15733964,Female,38,50000,0
15753861,Female,49,141000,1
15747097,Female,39,79000,0
15594762,Female,39,75000,1
15667417,Female,54,104000,1
15684861,Male,35,55000,0
15742204,Male,45,32000,1
15623502,Male,36,60000,0
15774872,Female,52,138000,1
15611191,Female,53,82000,1
15674331,Male,41,52000,0
15619465,Female,48,30000,1
15575247,Female,48,131000,1
15695679,Female,41,60000,0
15713463,Male,41,72000,0
15785170,Female,42,75000,0
15796351,Male,36,118000,1
15639576,Female,47,107000,1
15693264,Male,38,51000,0
15589715,Female,48,119000,1
15769902,Male,42,65000,0
15587177,Male,40,65000,0
15814553,Male,57,60000,1
15601550,Female,36,54000,0
15664907,Male,58,144000,1
15612465,Male,35,79000,0
15810800,Female,38,55000,0
15665760,Male,39,122000,1
15588080,Female,53,104000,1
15776844,Male,35,75000,0
15717560,Female,38,65000,0
15629739,Female,47,51000,1
15729908,Male,47,105000,1
15716781,Female,41,63000,0
15646936,Male,53,72000,1
15768151,Female,54,108000,1
15579212,Male,39,77000,0
15721835,Male,38,61000,0
15800515,Female,38,113000,1
15591279,Male,37,75000,0
15587419,Female,42,90000,1
15750335,Female,37,57000,0
15699619,Male,36,99000,1
15606472,Male,60,34000,1
15778368,Male,54,70000,1
15671387,Female,41,72000,0
15573926,Male,40,71000,1
15709183,Male,42,54000,0
15577514,Male,43,129000,1
15778830,Female,53,34000,1
15768072,Female,47,50000,1
15768293,Female,42,79000,0
15654456,Male,42,104000,1
15807525,Female,59,29000,1
15574372,Female,58,47000,1
15671249,Male,46,88000,1
15779744,Male,38,71000,0
15624755,Female,54,26000,1
15611430,Female,60,46000,1
15774744,Male,60,83000,1
15629885,Female,39,73000,0
15708791,Male,59,130000,1
15793890,Female,37,80000,0
15646091,Female,46,32000,1
15596984,Female,46,74000,0
15800215,Female,42,53000,0
15577806,Male,41,87000,1
15749381,Female,58,23000,1
15683758,Male,42,64000,0
15670615,Male,48,33000,1
15715622,Female,44,139000,1
15707634,Male,49,28000,1
15806901,Female,57,33000,1
15775335,Male,56,60000,1
15724150,Female,49,39000,1
15627220,Male,39,71000,0
15672330,Male,47,34000,1
15668521,Female,48,35000,1
15807837,Male,48,33000,1
15592570,Male,47,23000,1
15748589,Female,45,45000,1
15635893,Male,60,42000,1
15757632,Female,39,59000,0
15691863,Female,46,41000,1
15706071,Male,51,23000,1
15654296,Female,50,20000,1
15755018,Male,36,33000,0
15594041,Female,49,36000,1
New folder (2)/miniproject.rar
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/AUTHORS.txt
Behold, mortal, the origins of Beautiful Soup…
================================================
Leonard Richardson is the primary programmer.
Aaron DeVore is awesome.
Mark Pilgrim provided the encoding detection code that forms the base
of UnicodeDammit.
Thomas Kluyver and Ezio Melotti finished the work of getting Beautiful
Soup 4 working under Python 3.
Simon Willison wrote soupselect, which was used to make Beautiful Soup
support CSS selectors.
Sam Ruby helped with a lot of edge cases.
Jonathan Ellis was awarded the prestigous Beau Potage D’Or for his
work in solving the nestable tags conundrum.
An incomplete list of people have contributed patches to Beautiful
Soup:
Istvan Albert, Andrew Lin, Anthony Baxter, Andrew Boyko, Tony Chang,
Zephyr Fang, Fuzzy, Roman Gaufman, Yoni Gilad, Richie Hindle, Peteris
Krumins, Kent Johnson, Ben Last, Robert Leftwich, Staffan Malmgren,
Ksenia Marasanova, JP Moins, Adam Monsen, John Nagle, “Jon”, Ed
Oskiewicz, Greg Phillips, Giles Radford, Arthur Rudolph, Marko
Samastur, Jouni Seppänen, Alexander Schmolck, Andy Theyers, Glyn
Webster, Paul Wright, Danny Yoo
An incomplete list of people who made suggestions or found bugs or
found ways to break Beautiful Soup:
Hanno Böck, Matteo Bertini, Chris Curvey, Simon Cusack, Bruce Eckel,
Matt Ernst, Michael Foord, Tom Harris, Bill de hOra, Donald Howes,
Matt Patterson, Scott Roberts, Steve Strassmann, Mike Williams,
warchild at redho dot com, Sami Kuisma, Carlos Rocha, Bob Hutchison,
Joren Mc, Michal Migurski, John Kleven, Tim Heaney, Tripp Lilley, Ed
Summers, Dennis Sutch, Chris Smith, Aaron Sweep^W Swartz, Stuart
Turner, Greg Edwards, Kevin J Kalupson, Nikos Kouremenos, Artur de
Sousa Rocha, Yichun Wei, Per Vognsen
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/bs4/builder/_html5lib.py__all__ = [
‘HTML5TreeBuilder’,
]
import warnings
from bs4.builder import (
PERMISSIVE,
HTML,
HTML_5,
HTMLTreeBuilder,
)
from bs4.element import NamespacedAttribute
import html5lib
from html5lib.constants import namespaces
from bs4.element import (
Comment,
Doctype,
NavigableString,
Tag,
)
class HTML5TreeBuilder(HTMLTreeBuilder):
“””Use html5lib to build a tree.”””
features = [‘html5lib’, PERMISSIVE, HTML_5, HTML]
def prepare_markup(self, markup, user_specified_encoding):
# Store the user-specified encoding for use later on.
self.user_specified_encoding = user_specified_encoding
return markup, None, None, False
# These methods are defined by Beautiful Soup.
def feed(self, markup):
if self.soup.parse_only is not None:
warnings.warn(“You provided a value for parse_only, but the html5lib tree builder doesn’t support parse_only. The entire document will be parsed.”)
parser = html5lib.HTMLParser(tree=self.create_treebuilder)
doc = parser.parse(markup, encoding=self.user_specified_encoding)
# Set the character encoding detected by the tokenizer.
if isinstance(markup, unicode):
# We need to special-case this because html5lib sets
# charEncoding to UTF-8 if it gets Unicode input.
doc.original_encoding = None
else:
doc.original_encoding = parser.tokenizer.stream.charEncoding[0]
def create_treebuilder(self, namespaceHTMLElements):
self.underlying_builder = TreeBuilderForHtml5lib(
self.soup, namespaceHTMLElements)
return self.underlying_builder
def test_fragment_to_document(self, fragment):
“””See `TreeBuilder`.”””
return u’%s’ % fragment
class TreeBuilderForHtml5lib(html5lib.treebuilders._base.TreeBuilder):
def __init__(self, soup, namespaceHTMLElements):
self.soup = soup
super(TreeBuilderForHtml5lib, self).__init__(namespaceHTMLElements)
def documentClass(self):
self.soup.reset()
return Element(self.soup, self.soup, None)
def insertDoctype(self, token):
name = token[“name”]
publicId = token[“publicId”]
systemId = token[“systemId”]
doctype = Doctype.for_name_and_ids(name, publicId, systemId)
self.soup.object_was_parsed(doctype)
def elementClass(self, name, namespace):
tag = self.soup.new_tag(name, namespace)
return Element(tag, self.soup, namespace)
def commentClass(self, data):
return TextNode(Comment(data), self.soup)
def fragmentClass(self):
self.soup = BeautifulSoup(“”)
self.soup.name = “[document_fragment]”
return Element(self.soup, self.soup, None)
def appendChild(self, node):
# XXX This code is not covered by the BS4 tests.
self.soup.append(node.element)
def getDocument(self):
return self.soup
def getFragment(self):
return html5lib.treebuilders._base.TreeBuilder.getFragment(self).element
class AttrList(object):
def __init__(self, element):
self.element = element
self.attrs = dict(self.element.attrs)
def __iter__(self):
return list(self.attrs.items()).__iter__()
def __setitem__(self, name, value):
“set attr”, name, value
self.element[name] = value
def items(self):
return list(self.attrs.items())
def keys(self):
return list(self.attrs.keys())
def __len__(self):
return len(self.attrs)
def __getitem__(self, name):
return self.attrs[name]
def __contains__(self, name):
return name in list(self.attrs.keys())
class Element(html5lib.treebuilders._base.Node):
def __init__(self, element, soup, namespace):
html5lib.treebuilders._base.Node.__init__(self, element.name)
self.element = element
self.soup = soup
self.namespace = namespace
def appendChild(self, node):
if (node.element.__class__ == NavigableString and self.element.contents
and self.element.contents[-1].__class__ == NavigableString):
# Concatenate new text onto old text node
# XXX This has O(n^2) performance, for input like
# “aaa…”
old_element = self.element.contents[-1]
new_element = self.soup.new_string(old_element + node.element)
old_element.replace_with(new_element)
else:
self.element.append(node.element)
node.parent = self
def getAttributes(self):
return AttrList(self.element)
def setAttributes(self, attributes):
if attributes is not None and len(attributes) > 0:
converted_attributes = []
for name, value in list(attributes.items()):
if isinstance(name, tuple):
new_name = NamespacedAttribute(*name)
del attributes[name]
attributes[new_name] = value
self.soup.builder._replace_cdata_list_attribute_values(
self.name, attributes)
for name, value in attributes.items():
self.element[name] = value
# The attributes may contain variables that need substitution.
# Call set_up_substitutions manually.
#
# The Tag constructor called this method when the Tag was created,
# but we just set/changed the attributes, so call it again.
self.soup.builder.set_up_substitutions(self.element)
attributes = property(getAttributes, setAttributes)
def insertText(self, data, insertBefore=None):
text = TextNode(self.soup.new_string(data), self.soup)
if insertBefore:
self.insertBefore(text, insertBefore)
else:
self.appendChild(text)
def insertBefore(self, node, refNode):
index = self.element.index(refNode.element)
if (node.element.__class__ == NavigableString and self.element.contents
and self.element.contents[index-1].__class__ == NavigableString):
# (See comments in appendChild)
old_node = self.element.contents[index-1]
new_str = self.soup.new_string(old_node + node.element)
old_node.replace_with(new_str)
else:
self.element.insert(index, node.element)
node.parent = self
def removeChild(self, node):
node.element.extract()
def reparentChildren(self, newParent):
while self.element.contents:
child = self.element.contents[0]
child.extract()
if isinstance(child, Tag):
newParent.appendChild(
Element(child, self.soup, namespaces[“html”]))
else:
newParent.appendChild(
TextNode(child, self.soup))
def cloneNode(self):
tag = self.soup.new_tag(self.element.name, self.namespace)
node = Element(tag, self.soup, self.namespace)
for key,value in self.attributes:
node.attributes[key] = value
return node
def hasContent(self):
return self.element.contents
def getNameTuple(self):
if self.namespace == None:
return namespaces[“html”], self.name
else:
return self.namespace, self.name
nameTuple = property(getNameTuple)
class TextNode(Element):
def __init__(self, element, soup):
html5lib.treebuilders._base.Node.__init__(self, None)
self.element = element
self.soup = soup
def cloneNode(self):
raise NotImplementedError
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/bs4/builder/_htmlparser.py
“””Use the HTMLParser library to parse HTML files that aren’t too bad.”””
__all__ = [
‘HTMLParserTreeBuilder’,
]
from HTMLParser import (
HTMLParser,
HTMLParseError,
)
import sys
import warnings
# Starting in Python 3.2, the HTMLParser constructor takes a ‘strict’
# argument, which we’d like to set to False. Unfortunately,
# http://bugs.python.org/issue13273 makes strict=True a better bet
# before Python 3.2.3.
#
# At the end of this file, we monkeypatch HTMLParser so that
# strict=True works well on Python 3.2.2.
major, minor, release = sys.version_info[:3]
CONSTRUCTOR_TAKES_STRICT = (
major > 3
or (major == 3 and minor > 2)
or (major == 3 and minor == 2 and release >= 3))
from bs4.element import (
CData,
Comment,
Declaration,
Doctype,
ProcessingInstruction,
)
from bs4.dammit import EntitySubstitution, UnicodeDammit
from bs4.builder import (
HTML,
HTMLTreeBuilder,
STRICT,
)
HTMLPARSER = ‘html.parser’
class BeautifulSoupHTMLParser(HTMLParser):
def handle_starttag(self, name, attrs):
# XXX namespace
self.soup.handle_starttag(name, None, None, dict(attrs))
def handle_endtag(self, name):
self.soup.handle_endtag(name)
def handle_data(self, data):
self.soup.handle_data(data)
def handle_charref(self, name):
# XXX workaround for a bug in HTMLParser. Remove this once
# it’s fixed.
if name.startswith(‘x’):
real_name = int(name.lstrip(‘x’), 16)
else:
real_name = int(name)
try:
data = unichr(real_name)
except (ValueError, OverflowError), e:
data = u”\N{REPLACEMENT CHARACTER}”
self.handle_data(data)
def handle_entityref(self, name):
character = EntitySubstitution.HTML_ENTITY_TO_CHARACTER.get(name)
if character is not None:
data = character
else:
data = “&%s;” % name
self.handle_data(data)
def handle_comment(self, data):
self.soup.endData()
self.soup.handle_data(data)
self.soup.endData(Comment)
def handle_decl(self, data):
self.soup.endData()
if data.startswith(“DOCTYPE “):
data = data[len(“DOCTYPE “):]
self.soup.handle_data(data)
self.soup.endData(Doctype)
def unknown_decl(self, data):
if data.upper().startswith(‘CDATA[‘):
cls = CData
data = data[len(‘CDATA[‘):]
else:
cls = Declaration
self.soup.endData()
self.soup.handle_data(data)
self.soup.endData(cls)
def handle_pi(self, data):
self.soup.endData()
if data.endswith(“?”) and data.lower().startswith(“xml”):
# “An XHTML processing instruction using the trailing ‘?’
# will cause the ‘?’ to be included in data.” – HTMLParser
# docs.
#
# Strip the question mark so we don’t end up with two
# question marks.
data = data[:-1]
self.soup.handle_data(data)
self.soup.endData(ProcessingInstruction)
class HTMLParserTreeBuilder(HTMLTreeBuilder):
is_xml = False
features = [HTML, STRICT, HTMLPARSER]
def __init__(self, *args, **kwargs):
if CONSTRUCTOR_TAKES_STRICT:
kwargs[‘strict’] = False
self.parser_args = (args, kwargs)
def prepare_markup(self, markup, user_specified_encoding=None,
document_declared_encoding=None):
“””
:return: A 4-tuple (markup, original encoding, encoding
declared within markup, whether any characters had to be
replaced with REPLACEMENT CHARACTER).
“””
if isinstance(markup, unicode):
return markup, None, None, False
try_encodings = [user_specified_encoding, document_declared_encoding]
dammit = UnicodeDammit(markup, try_encodings, is_html=True)
return (dammit.markup, dammit.original_encoding,
dammit.declared_html_encoding,
dammit.contains_replacement_characters)
def feed(self, markup):
args, kwargs = self.parser_args
parser = BeautifulSoupHTMLParser(*args, **kwargs)
parser.soup = self.soup
try:
parser.feed(markup)
except HTMLParseError, e:
warnings.warn(RuntimeWarning(
“Python’s built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help.”))
raise e
# Patch 3.2 versions of HTMLParser earlier than 3.2.3 to use some
# 3.2.3 code. This ensures they don’t treat markup like as a
# string.
#
# XXX This code can be removed once most Python 3 users are on 3.2.3.
if major == 3 and minor == 2 and not CONSTRUCTOR_TAKES_STRICT:
import re
attrfind_tolerant = re.compile(
r’\s*((?<=[\'"\s])[^\s/>][^\s/=>]*)(\s*=+\s*’
r'(\'[^\’]*\’|”[^”]*”|(?![\'”])[^>\s]*))?’)
HTMLParserTreeBuilder.attrfind_tolerant = attrfind_tolerant
locatestarttagend = re.compile(r”””
<[a-zA-Z][-.a-zA-Z0-9:_]* # tag name
(?:\s+ # whitespace before attribute name
(?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name
(?:\s*=\s* # value indicator
(?:'[^']*' # LITA-enclosed value
|\"[^\"]*\" # LIT-enclosed value
|[^'\">\s]+ # bare value
)
)?
)
)*
\s* # trailing whitespace
“””, re.VERBOSE)
BeautifulSoupHTMLParser.locatestarttagend = locatestarttagend
from html.parser import tagfind, attrfind
def parse_starttag(self, i):
self.__starttag_text = None
endpos = self.check_for_whole_start_tag(i)
if endpos < 0:
return endpos
rawdata = self.rawdata
self.__starttag_text = rawdata[i:endpos]
# Now parse the data between i+1 and j into a tag and attrs
attrs = []
match = tagfind.match(rawdata, i+1)
assert match, 'unexpected call to parse_starttag()'
k = match.end()
self.lasttag = tag = rawdata[i+1:k].lower()
while k < endpos:
if self.strict:
m = attrfind.match(rawdata, k)
else:
m = attrfind_tolerant.match(rawdata, k)
if not m:
break
attrname, rest, attrvalue = m.group(1, 2, 3)
if not rest:
attrvalue = None
elif attrvalue[:1] == '\'' == attrvalue[-1:] or \
attrvalue[:1] == '"' == attrvalue[-1:]:
attrvalue = attrvalue[1:-1]
if attrvalue:
attrvalue = self.unescape(attrvalue)
attrs.append((attrname.lower(), attrvalue))
k = m.end()
end = rawdata[k:endpos].strip()
if end not in (">“, “/>”):
lineno, offset = self.getpos()
if “\n” in self.__starttag_text:
lineno = lineno + self.__starttag_text.count(“\n”)
offset = len(self.__starttag_text) \
– self.__starttag_text.rfind(“\n”)
else:
offset = offset + len(self.__starttag_text)
if self.strict:
self.error(“junk characters in start tag: %r”
% (rawdata[k:endpos][:20],))
self.handle_data(rawdata[i:endpos])
return endpos
if end.endswith(‘/>’):
# XHTML-style empty tag:
self.handle_startendtag(tag, attrs)
else:
self.handle_starttag(tag, attrs)
if tag in self.CDATA_CONTENT_ELEMENTS:
self.set_cdata_mode(tag)
return endpos
def set_cdata_mode(self, elem):
self.cdata_elem = elem.lower()
self.interesting = re.compile(r’\s*%s\s*>‘ % self.cdata_elem, re.I)
BeautifulSoupHTMLParser.parse_starttag = parse_starttag
BeautifulSoupHTMLParser.set_cdata_mode = set_cdata_mode
CONSTRUCTOR_TAKES_STRICT = True
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/bs4/builder/_lxml.py__all__ = [
‘LXMLTreeBuilderForXML’,
‘LXMLTreeBuilder’,
]
from StringIO import StringIO
import collections
from lxml import etree
from bs4.element import Comment, Doctype, NamespacedAttribute
from bs4.builder import (
FAST,
HTML,
HTMLTreeBuilder,
PERMISSIVE,
TreeBuilder,
XML)
from bs4.dammit import UnicodeDammit
LXML = ‘lxml’
class LXMLTreeBuilderForXML(TreeBuilder):
DEFAULT_PARSER_CLASS = etree.XMLParser
is_xml = True
# Well, it’s permissive by XML parser standards.
features = [LXML, XML, FAST, PERMISSIVE]
CHUNK_SIZE = 512
@property
def default_parser(self):
# This can either return a parser object or a class, which
# will be instantiated with default arguments.
return etree.XMLParser(target=self, strip_cdata=False, recover=True)
def __init__(self, parser=None, empty_element_tags=None):
if empty_element_tags is not None:
self.empty_element_tags = set(empty_element_tags)
if parser is None:
# Use the default parser.
parser = self.default_parser
if isinstance(parser, collections.Callable):
# Instantiate the parser with default arguments
parser = parser(target=self, strip_cdata=False)
self.parser = parser
self.soup = None
self.nsmaps = None
def _getNsTag(self, tag):
# Split the namespace URL out of a fully-qualified lxml tag
# name. Copied from lxml’s src/lxml/sax.py.
if tag[0] == ‘{‘:
return tuple(tag[1:].split(‘}’, 1))
else:
return (None, tag)
def prepare_markup(self, markup, user_specified_encoding=None,
document_declared_encoding=None):
“””
:return: A 3-tuple (markup, original encoding, encoding
declared within markup).
“””
if isinstance(markup, unicode):
return markup, None, None, False
try_encodings = [user_specified_encoding, document_declared_encoding]
dammit = UnicodeDammit(markup, try_encodings, is_html=True)
return (dammit.markup, dammit.original_encoding,
dammit.declared_html_encoding,
dammit.contains_replacement_characters)
def feed(self, markup):
if isinstance(markup, basestring):
markup = StringIO(markup)
# Call feed() at least once, even if the markup is empty,
# or the parser won’t be initialized.
data = markup.read(self.CHUNK_SIZE)
self.parser.feed(data)
while data != ”:
# Now call feed() on the rest of the data, chunk by chunk.
data = markup.read(self.CHUNK_SIZE)
if data != ”:
self.parser.feed(data)
self.parser.close()
def close(self):
self.nsmaps = None
def start(self, name, attrs, nsmap={}):
# Make sure attrs is a mutable dict–lxml may send an immutable dictproxy.
attrs = dict(attrs)
nsprefix = None
# Invert each namespace map as it comes in.
if len(nsmap) == 0 and self.nsmaps != None:
# There are no new namespaces for this tag, but namespaces
# are in play, so we need a separate tag stack to know
# when they end.
self.nsmaps.append(None)
elif len(nsmap) > 0:
# A new namespace mapping has come into play.
if self.nsmaps is None:
self.nsmaps = []
inverted_nsmap = dict((value, key) for key, value in nsmap.items())
self.nsmaps.append(inverted_nsmap)
# Also treat the namespace mapping as a set of attributes on the
# tag, so we can recreate it later.
attrs = attrs.copy()
for prefix, namespace in nsmap.items():
attribute = NamespacedAttribute(
“xmlns”, prefix, “http://www.w3.org/2000/xmlns/”)
attrs[attribute] = namespace
namespace, name = self._getNsTag(name)
if namespace is not None:
for inverted_nsmap in reversed(self.nsmaps):
if inverted_nsmap is not None and namespace in inverted_nsmap:
nsprefix = inverted_nsmap[namespace]
break
self.soup.handle_starttag(name, namespace, nsprefix, attrs)
def end(self, name):
self.soup.endData()
completed_tag = self.soup.tagStack[-1]
namespace, name = self._getNsTag(name)
nsprefix = None
if namespace is not None:
for inverted_nsmap in reversed(self.nsmaps):
if inverted_nsmap is not None and namespace in inverted_nsmap:
nsprefix = inverted_nsmap[namespace]
break
self.soup.handle_endtag(name, nsprefix)
if self.nsmaps != None:
# This tag, or one of its parents, introduced a namespace
# mapping, so pop it off the stack.
self.nsmaps.pop()
if len(self.nsmaps) == 0:
# Namespaces are no longer in play, so don’t bother keeping
# track of the namespace stack.
self.nsmaps = None
def pi(self, target, data):
pass
def data(self, content):
self.soup.handle_data(content)
def doctype(self, name, pubid, system):
self.soup.endData()
doctype = Doctype.for_name_and_ids(name, pubid, system)
self.soup.object_was_parsed(doctype)
def comment(self, content):
“Handle comments as Comment objects.”
self.soup.endData()
self.soup.handle_data(content)
self.soup.endData(Comment)
def test_fragment_to_document(self, fragment):
“””See `TreeBuilder`.”””
return u’\n%s’ % fragment
class LXMLTreeBuilder(HTMLTreeBuilder, LXMLTreeBuilderForXML):
features = [LXML, HTML, FAST, PERMISSIVE]
is_xml = False
@property
def default_parser(self):
return etree.HTMLParser
def feed(self, markup):
self.parser.feed(markup)
self.parser.close()
def test_fragment_to_document(self, fragment):
“””See `TreeBuilder`.”””
return u’%s’ % fragment
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/bs4/builder/__init__.py
from collections import defaultdict
import itertools
import sys
from bs4.element import (
CharsetMetaAttributeValue,
ContentMetaAttributeValue,
whitespace_re
)
__all__ = [
‘HTMLTreeBuilder’,
‘SAXTreeBuilder’,
‘TreeBuilder’,
‘TreeBuilderRegistry’,
]
# Some useful features for a TreeBuilder to have.
FAST = ‘fast’
PERMISSIVE = ‘permissive’
STRICT = ‘strict’
XML = ‘xml’
HTML = ‘html’
HTML_5 = ‘html5’
class TreeBuilderRegistry(object):
def __init__(self):
self.builders_for_feature = defaultdict(list)
self.builders = []
def register(self, treebuilder_class):
“””Register a treebuilder based on its advertised features.”””
for feature in treebuilder_class.features:
self.builders_for_feature[feature].insert(0, treebuilder_class)
self.builders.insert(0, treebuilder_class)
def lookup(self, *features):
if len(self.builders) == 0:
# There are no builders at all.
return None
if len(features) == 0:
# They didn’t ask for any features. Give them the most
# recently registered builder.
return self.builders[0]
# Go down the list of features in order, and eliminate any builders
# that don’t match every feature.
features = list(features)
features.reverse()
candidates = None
candidate_set = None
while len(features) > 0:
feature = features.pop()
we_have_the_feature = self.builders_for_feature.get(feature, [])
if len(we_have_the_feature) > 0:
if candidates is None:
candidates = we_have_the_feature
candidate_set = set(candidates)
else:
# Eliminate any candidates that don’t have this feature.
candidate_set = candidate_set.intersection(
set(we_have_the_feature))
# The only valid candidates are the ones in candidate_set.
# Go through the original list of candidates and pick the first one
# that’s in candidate_set.
if candidate_set is None:
return None
for candidate in candidates:
if candidate in candidate_set:
return candidate
return None
# The BeautifulSoup class will take feature lists from developers and use them
# to look up builders in this registry.
builder_registry = TreeBuilderRegistry()
class TreeBuilder(object):
“””Turn a document into a Beautiful Soup object tree.”””
features = []
is_xml = False
preserve_whitespace_tags = set()
empty_element_tags = None # A tag will be considered an empty-element
# tag when and only when it has no contents.
# A value for these tag/attribute combinations is a space- or
# comma-separated list of CDATA, rather than a single CDATA.
cdata_list_attributes = {}
def __init__(self):
self.soup = None
def reset(self):
pass
def can_be_empty_element(self, tag_name):
“””Might a tag with this name be an empty-element tag?
The final markup may or may not actually present this tag as
self-closing.
For instance: an HTMLBuilder does not consider a
tag to be
an empty-element tag (it’s not in
HTMLBuilder.empty_element_tags). This means an empty
tag
will be presented as “”, not “
The default implementation has no opinion about which tags are
empty-element tags, so a tag will be presented as an
empty-element tag if and only if it has no contents.
“
be left alone.
“””
if self.empty_element_tags is None:
return True
return tag_name in self.empty_element_tags
def feed(self, markup):
raise NotImplementedError()
def prepare_markup(self, markup, user_specified_encoding=None,
document_declared_encoding=None):
return markup, None, None, False
def test_fragment_to_document(self, fragment):
“””Wrap an HTML fragment to make it look like a document.
Different parsers do this differently. For instance, lxml
introduces an empty tag, and html5lib
doesn’t. Abstracting this away lets us write simple tests
which run HTML fragments through the parser and compare the
results against other HTML fragments.
This method should not be used outside of tests.
“””
return fragment
def set_up_substitutions(self, tag):
return False
def _replace_cdata_list_attribute_values(self, tag_name, attrs):
“””Replaces class=”foo bar” with class=[“foo”, “bar”]
Modifies its input in place.
“””
if self.cdata_list_attributes:
universal = self.cdata_list_attributes.get(‘*’, [])
tag_specific = self.cdata_list_attributes.get(
tag_name.lower(), [])
for cdata_list_attr in itertools.chain(universal, tag_specific):
if cdata_list_attr in dict(attrs):
# Basically, we have a “class” attribute whose
# value is a whitespace-separated list of CSS
# classes. Split it into a list.
value = attrs[cdata_list_attr]
values = whitespace_re.split(value)
attrs[cdata_list_attr] = values
return attrs
class SAXTreeBuilder(TreeBuilder):
“””A Beautiful Soup treebuilder that listens for SAX events.”””
def feed(self, markup):
raise NotImplementedError()
def close(self):
pass
def startElement(self, name, attrs):
attrs = dict((key[1], value) for key, value in list(attrs.items()))
#print “Start %s, %r” % (name, attrs)
self.soup.handle_starttag(name, attrs)
def endElement(self, name):
#print “End %s” % name
self.soup.handle_endtag(name)
def startElementNS(self, nsTuple, nodeName, attrs):
# Throw away (ns, nodeName) for now.
self.startElement(nodeName, attrs)
def endElementNS(self, nsTuple, nodeName):
# Throw away (ns, nodeName) for now.
self.endElement(nodeName)
#handler.endElementNS((ns, node.nodeName), node.nodeName)
def startPrefixMapping(self, prefix, nodeValue):
# Ignore the prefix for now.
pass
def endPrefixMapping(self, prefix):
# Ignore the prefix for now.
# handler.endPrefixMapping(prefix)
pass
def characters(self, content):
self.soup.handle_data(content)
def startDocument(self):
pass
def endDocument(self):
pass
class HTMLTreeBuilder(TreeBuilder):
“””This TreeBuilder knows facts about HTML.
Such as which tags are empty-element tags.
“””
preserve_whitespace_tags = set([‘pre’, ‘textarea’])
empty_element_tags = set([‘br’ , ‘hr’, ‘input’, ‘img’, ‘meta’,
‘spacer’, ‘link’, ‘frame’, ‘base’])
# The HTML standard defines these attributes as containing a
# space-separated list of values, not a single value. That is,
# class=”foo bar” means that the ‘class’ attribute has two values,
# ‘foo’ and ‘bar’, not the single value ‘foo bar’. When we
# encounter one of these attributes, we will parse its value into
# a list of values if possible. Upon output, the list will be
# converted back into a string.
cdata_list_attributes = {
“*” : [‘class’, ‘accesskey’, ‘dropzone’],
“a” : [‘rel’, ‘rev’],
“link” : [‘rel’, ‘rev’],
“td” : [“headers”],
“th” : [“headers”],
“td” : [“headers”],
“form” : [“accept-charset”],
“object” : [“archive”],
# These are HTML5 specific, as are *.accesskey and *.dropzone above.
“area” : [“rel”],
“icon” : [“sizes”],
“iframe” : [“sandbox”],
“output” : [“for”],
}
def set_up_substitutions(self, tag):
# We are only interested in tags
if tag.name != ‘meta’:
return False
http_equiv = tag.get(‘http-equiv’)
content = tag.get(‘content’)
charset = tag.get(‘charset’)
# We are interested in tags that say what encoding the
# document was originally in. This means HTML 5-style
# tags that provide the “charset” attribute. It also means
# HTML 4-style tags that provide the “content”
# attribute and have “http-equiv” set to “content-type”.
#
# In both cases we will replace the value of the appropriate
# attribute with a standin object that can take on any
# encoding.
meta_encoding = None
if charset is not None:
# HTML 5 style:
#
meta_encoding = charset
tag[‘charset’] = CharsetMetaAttributeValue(charset)
elif (content is not None and http_equiv is not None
and http_equiv.lower() == ‘content-type’):
# HTML 4 style:
#
tag[‘content’] = ContentMetaAttributeValue(content)
return (meta_encoding is not None)
def register_treebuilders_from(module):
“””Copy TreeBuilders from the given module into this module.”””
# I’m fairly sure this is not the best way to do this.
this_module = sys.modules[‘bs4.builder’]
for name in module.__all__:
obj = getattr(module, name)
if issubclass(obj, TreeBuilder):
setattr(this_module, name, obj)
this_module.__all__.append(name)
# Register the builder while we’re at it.
this_module.builder_registry.register(obj)
# Builders are registered in reverse order of priority, so that custom
# builder registrations will take precedence. In general, we want lxml
# to take precedence over html5lib, because it’s faster. And we only
# want to use HTMLParser as a last result.
from . import _htmlparser
register_treebuilders_from(_htmlparser)
try:
from . import _html5lib
register_treebuilders_from(_html5lib)
except ImportError:
# They don’t have html5lib installed.
pass
try:
from . import _lxml
register_treebuilders_from(_lxml)
except ImportError:
# They don’t have lxml installed.
pass
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/bs4/dammit.py
# -*- coding: utf-8 -*-
“””Beautiful Soup bonus library: Unicode, Dammit
This class forces XML data into a standard format (usually to UTF-8 or
Unicode). It is heavily based on code from Mark Pilgrim’s Universal
Feed Parser. It does not rewrite the XML or HTML to reflect a new
encoding; that’s the tree builder’s job.
“””
import codecs
from htmlentitydefs import codepoint2name
import re
import warnings
# Autodetects character encodings. Very useful.
# Download from http://chardet.feedparser.org/
# or ‘apt-get install python-chardet’
# or ‘easy_install chardet’
try:
import chardet
#import chardet.constants
#chardet.constants._debug = 1
except ImportError:
chardet = None
# Available from http://cjkpython.i18n.org/.
try:
import iconv_codec
except ImportError:
pass
xml_encoding_re = re.compile(
‘^<\?.*encoding=[\'"](.*?)[\'"].*\?>‘.encode(), re.I)
html_meta_re = re.compile(
‘<\s*meta[^>]+charset\s*=\s*[“\’]?([^>]*?)[ /;\'”>]’.encode(), re.I)
class EntitySubstitution(object):
“””Substitute XML or HTML entities for the corresponding characters.”””
def _populate_class_variables():
lookup = {}
reverse_lookup = {}
characters_for_re = []
for codepoint, name in list(codepoint2name.items()):
character = unichr(codepoint)
if codepoint != 34:
# There’s no point in turning the quotation mark into
# ", unless it happens within an attribute value, which
# is handled elsewhere.
characters_for_re.append(character)
lookup[character] = name
# But we do want to turn " into the quotation mark.
reverse_lookup[name] = character
re_definition = “[%s]” % “”.join(characters_for_re)
return lookup, reverse_lookup, re.compile(re_definition)
(CHARACTER_TO_HTML_ENTITY, HTML_ENTITY_TO_CHARACTER,
CHARACTER_TO_HTML_ENTITY_RE) = _populate_class_variables()
CHARACTER_TO_XML_ENTITY = {
“‘”: “apos”,
‘”‘: “quot”,
“&”: “amp”,
“<": "lt",
">“: “gt”,
}
BARE_AMPERSAND_OR_BRACKET = re.compile(“([<>]|”
“&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;)”
“)”)
@classmethod
def _substitute_html_entity(cls, matchobj):
entity = cls.CHARACTER_TO_HTML_ENTITY.get(matchobj.group(0))
return “&%s;” % entity
@classmethod
def _substitute_xml_entity(cls, matchobj):
“””Used with a regular expression to substitute the
appropriate XML entity for an XML special character.”””
entity = cls.CHARACTER_TO_XML_ENTITY[matchobj.group(0)]
return “&%s;” % entity
@classmethod
def quoted_attribute_value(self, value):
“””Make a value into a quoted XML attribute, possibly escaping it.
Most strings will be quoted using double quotes.
Bob’s Bar -> “Bob’s Bar”
If a string contains double quotes, it will be quoted using
single quotes.
Welcome to “my bar” -> ‘Welcome to “my bar”‘
If a string contains both single and double quotes, the
double quotes will be escaped, and the string will be quoted
using double quotes.
Welcome to “Bob’s Bar” -> “Welcome to "Bob’s bar"
“””
quote_with = ‘”‘
if ‘”‘ in value:
if “‘” in value:
# The string contains both single and double
# quotes. Turn the double quotes into
# entities. We quote the double quotes rather than
# the single quotes because the entity name is
# “"” whether this is HTML or XML. If we
# quoted the single quotes, we’d have to decide
# between ' and &squot;.
replace_with = “"”
value = value.replace(‘”‘, replace_with)
else:
# There are double quotes but no single quotes.
# We can use single quotes to quote the attribute.
quote_with = “‘”
return quote_with + value + quote_with
@classmethod
def substitute_xml(cls, value, make_quoted_attribute=False):
“””Substitute XML entities for special XML characters.
:param value: A string to be substituted. The less-than sign will
become <, the greater-than sign will become >, and any
ampersands that are not part of an entity defition will
become &.
:param make_quoted_attribute: If True, then the string will be
quoted, as befits an attribute value.
“””
# Escape angle brackets, and ampersands that aren’t part of
# entities.
value = cls.BARE_AMPERSAND_OR_BRACKET.sub(
cls._substitute_xml_entity, value)
if make_quoted_attribute:
value = cls.quoted_attribute_value(value)
return value
@classmethod
def substitute_html(cls, s):
“””Replace certain Unicode characters with named HTML entities.
This differs from data.encode(encoding, ‘xmlcharrefreplace’)
in that the goal is to make the result more readable (to those
with ASCII displays) rather than to recover from
errors. There’s absolutely nothing wrong with a UTF-8 string
containg a LATIN SMALL LETTER E WITH ACUTE, but replacing that
character with “é” will make it more readable to some
people.
“””
return cls.CHARACTER_TO_HTML_ENTITY_RE.sub(
cls._substitute_html_entity, s)
class UnicodeDammit:
“””A class for detecting the encoding of a *ML document and
converting it to a Unicode string. If the source encoding is
windows-1252, can replace MS smart quotes with their HTML or XML
equivalents.”””
# This dictionary maps commonly seen values for “charset” in HTML
# meta tags to the corresponding Python codec names. It only covers
# values that aren’t in Python’s aliases and can’t be determined
# by the heuristics in find_codec.
CHARSET_ALIASES = {“macintosh”: “mac-roman”,
“x-sjis”: “shift-jis”}
ENCODINGS_WITH_SMART_QUOTES = [
“windows-1252”,
“iso-8859-1”,
“iso-8859-2″,
]
def __init__(self, markup, override_encodings=[],
smart_quotes_to=None, is_html=False):
self.declared_html_encoding = None
self.smart_quotes_to = smart_quotes_to
self.tried_encodings = []
self.contains_replacement_characters = False
if markup == ” or isinstance(markup, unicode):
self.markup = markup
self.unicode_markup = unicode(markup)
self.original_encoding = None
return
new_markup, document_encoding, sniffed_encoding = \
self._detectEncoding(markup, is_html)
self.markup = new_markup
u = None
if new_markup != markup:
# _detectEncoding modified the markup, then converted it to
# Unicode and then to UTF-8. So convert it from UTF-8.
u = self._convert_from(“utf8”)
self.original_encoding = sniffed_encoding
if not u:
for proposed_encoding in (
override_encodings + [document_encoding, sniffed_encoding]):
if proposed_encoding is not None:
u = self._convert_from(proposed_encoding)
if u:
break
# If no luck and we have auto-detection library, try that:
if not u and chardet and not isinstance(self.markup, unicode):
u = self._convert_from(chardet.detect(self.markup)[‘encoding’])
# As a last resort, try utf-8 and windows-1252:
if not u:
for proposed_encoding in (“utf-8”, “windows-1252”):
u = self._convert_from(proposed_encoding)
if u:
break
# As an absolute last resort, try the encodings again with
# character replacement.
if not u:
for proposed_encoding in (
override_encodings + [
document_encoding, sniffed_encoding, “utf-8”, “windows-1252”]):
if proposed_encoding != “ascii”:
u = self._convert_from(proposed_encoding, “replace”)
if u is not None:
warnings.warn(
UnicodeWarning(
“Some characters could not be decoded, and were ”
“replaced with REPLACEMENT CHARACTER.”))
self.contains_replacement_characters = True
break
# We could at this point force it to ASCII, but that would
# destroy so much data that I think giving up is better
self.unicode_markup = u
if not u:
self.original_encoding = None
def _sub_ms_char(self, match):
“””Changes a MS smart quote character to an XML or HTML
entity, or an ASCII character.”””
orig = match.group(1)
if self.smart_quotes_to == ‘ascii’:
sub = self.MS_CHARS_TO_ASCII.get(orig).encode()
else:
sub = self.MS_CHARS.get(orig)
if type(sub) == tuple:
if self.smart_quotes_to == ‘xml’:
sub = ‘&#x’.encode() + sub[1].encode() + ‘;’.encode()
else:
sub = ‘&’.encode() + sub[0].encode() + ‘;’.encode()
else:
sub = sub.encode()
return sub
def _convert_from(self, proposed, errors=”strict”):
proposed = self.find_codec(proposed)
if not proposed or (proposed, errors) in self.tried_encodings:
return None
self.tried_encodings.append((proposed, errors))
markup = self.markup
# Convert smart quotes to HTML if coming from an encoding
# that might have them.
if (self.smart_quotes_to is not None
and proposed.lower() in self.ENCODINGS_WITH_SMART_QUOTES):
smart_quotes_re = b”([\x80-\x9f])”
smart_quotes_compiled = re.compile(smart_quotes_re)
markup = smart_quotes_compiled.sub(self._sub_ms_char, markup)
try:
#print “Trying to convert document to %s (errors=%s)” % (
# proposed, errors)
u = self._to_unicode(markup, proposed, errors)
self.markup = u
self.original_encoding = proposed
except Exception as e:
#print “That didn’t work!”
#print e
return None
#print “Correct encoding: %s” % proposed
return self.markup
def _to_unicode(self, data, encoding, errors=”strict”):
”’Given a string and its encoding, decodes the string into Unicode.
%encoding is a string recognized by encodings.aliases”’
# strip Byte Order Mark (if present)
if (len(data) >= 4) and (data[:2] == ‘\xfe\xff’) \
and (data[2:4] != ‘\x00\x00’):
encoding = ‘utf-16be’
data = data[2:]
elif (len(data) >= 4) and (data[:2] == ‘\xff\xfe’) \
and (data[2:4] != ‘\x00\x00’):
encoding = ‘utf-16le’
data = data[2:]
elif data[:3] == ‘\xef\xbb\xbf’:
encoding = ‘utf-8’
data = data[3:]
elif data[:4] == ‘\x00\x00\xfe\xff’:
encoding = ‘utf-32be’
data = data[4:]
elif data[:4] == ‘\xff\xfe\x00\x00’:
encoding = ‘utf-32le’
data = data[4:]
newdata = unicode(data, encoding, errors)
return newdata
def _detectEncoding(self, xml_data, is_html=False):
“””Given a document, tries to detect its XML encoding.”””
xml_encoding = sniffed_xml_encoding = None
try:
if xml_data[:4] == b’\x4c\x6f\xa7\x94′:
# EBCDIC
xml_data = self._ebcdic_to_ascii(xml_data)
elif xml_data[:4] == b’\x00\x3c\x00\x3f’:
# UTF-16BE
sniffed_xml_encoding = ‘utf-16be’
xml_data = unicode(xml_data, ‘utf-16be’).encode(‘utf-8′)
elif (len(xml_data) >= 4) and (xml_data[:2] == b’\xfe\xff’) \
and (xml_data[2:4] != b’\x00\x00′):
# UTF-16BE with BOM
sniffed_xml_encoding = ‘utf-16be’
xml_data = unicode(xml_data[2:], ‘utf-16be’).encode(‘utf-8′)
elif xml_data[:4] == b’\x3c\x00\x3f\x00’:
# UTF-16LE
sniffed_xml_encoding = ‘utf-16le’
xml_data = unicode(xml_data, ‘utf-16le’).encode(‘utf-8′)
elif (len(xml_data) >= 4) and (xml_data[:2] == b’\xff\xfe’) and \
(xml_data[2:4] != b’\x00\x00′):
# UTF-16LE with BOM
sniffed_xml_encoding = ‘utf-16le’
xml_data = unicode(xml_data[2:], ‘utf-16le’).encode(‘utf-8′)
elif xml_data[:4] == b’\x00\x00\x00\x3c’:
# UTF-32BE
sniffed_xml_encoding = ‘utf-32be’
xml_data = unicode(xml_data, ‘utf-32be’).encode(‘utf-8′)
elif xml_data[:4] == b’\x3c\x00\x00\x00’:
# UTF-32LE
sniffed_xml_encoding = ‘utf-32le’
xml_data = unicode(xml_data, ‘utf-32le’).encode(‘utf-8′)
elif xml_data[:4] == b’\x00\x00\xfe\xff’:
# UTF-32BE with BOM
sniffed_xml_encoding = ‘utf-32be’
xml_data = unicode(xml_data[4:], ‘utf-32be’).encode(‘utf-8′)
elif xml_data[:4] == b’\xff\xfe\x00\x00’:
# UTF-32LE with BOM
sniffed_xml_encoding = ‘utf-32le’
xml_data = unicode(xml_data[4:], ‘utf-32le’).encode(‘utf-8′)
elif xml_data[:3] == b’\xef\xbb\xbf’:
# UTF-8 with BOM
sniffed_xml_encoding = ‘utf-8’
xml_data = unicode(xml_data[3:], ‘utf-8’).encode(‘utf-8’)
else:
sniffed_xml_encoding = ‘ascii’
pass
except:
xml_encoding_match = None
xml_encoding_match = xml_encoding_re.match(xml_data)
if not xml_encoding_match and is_html:
xml_encoding_match = html_meta_re.search(xml_data)
if xml_encoding_match is not None:
xml_encoding = xml_encoding_match.groups()[0].decode(
‘ascii’).lower()
if is_html:
self.declared_html_encoding = xml_encoding
if sniffed_xml_encoding and \
(xml_encoding in (‘iso-10646-ucs-2’, ‘ucs-2’, ‘csunicode’,
‘iso-10646-ucs-4’, ‘ucs-4’, ‘csucs4’,
‘utf-16’, ‘utf-32’, ‘utf_16’, ‘utf_32’,
‘utf16’, ‘u16′)):
xml_encoding = sniffed_xml_encoding
return xml_data, xml_encoding, sniffed_xml_encoding
def find_codec(self, charset):
return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \
or (charset and self._codec(charset.replace(“-“, “”))) \
or (charset and self._codec(charset.replace(“-“, “_”))) \
or charset
def _codec(self, charset):
if not charset:
return charset
codec = None
try:
codecs.lookup(charset)
codec = charset
except (LookupError, ValueError):
pass
return codec
EBCDIC_TO_ASCII_MAP = None
def _ebcdic_to_ascii(self, s):
c = self.__class__
if not c.EBCDIC_TO_ASCII_MAP:
emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15,
16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31,
128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7,
144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26,
32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33,
38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94,
45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63,
186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34,
195,97,98,99,100,101,102,103,104,105,196,197,198,199,200,
201,202,106,107,108,109,110,111,112,113,114,203,204,205,
206,207,208,209,126,115,116,117,118,119,120,121,122,210,
211,212,213,214,215,216,217,218,219,220,221,222,223,224,
225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72,
73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81,
82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89,
90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57,
250,251,252,253,254,255)
import string
c.EBCDIC_TO_ASCII_MAP = string.maketrans(
”.join(map(chr, list(range(256)))), ”.join(map(chr, emap)))
return s.translate(c.EBCDIC_TO_ASCII_MAP)
# A partial mapping of ISO-Latin-1 to HTML entities/XML numeric entities.
MS_CHARS = {b’\x80’: (‘euro’, ’20AC’),
b’\x81′: ‘ ‘,
b’\x82’: (‘sbquo’, ‘201A’),
b’\x83′: (‘fnof’, ‘192’),
b’\x84′: (‘bdquo’, ‘201E’),
b’\x85′: (‘hellip’, ‘2026’),
b’\x86′: (‘dagger’, ‘2020’),
b’\x87′: (‘Dagger’, ‘2021’),
b’\x88′: (‘circ’, ‘2C6′),
b’\x89’: (‘permil’, ‘2030’),
b’\x8A’: (‘Scaron’, ‘160’),
b’\x8B’: (‘lsaquo’, ‘2039’),
b’\x8C’: (‘OElig’, ‘152’),
b’\x8D’: ‘?’,
b’\x8E’: (‘#x17D’, ’17D’),
b’\x8F’: ‘?’,
b’\x90′: ‘?’,
b’\x91′: (‘lsquo’, ‘2018’),
b’\x92′: (‘rsquo’, ‘2019’),
b’\x93′: (‘ldquo’, ‘201C’),
b’\x94′: (‘rdquo’, ‘201D’),
b’\x95′: (‘bull’, ‘2022’),
b’\x96′: (‘ndash’, ‘2013’),
b’\x97′: (‘mdash’, ‘2014’),
b’\x98′: (’tilde’, ‘2DC’),
b’\x99′: (‘trade’, ‘2122’),
b’\x9a’: (‘scaron’, ‘161’),
b’\x9b’: (‘rsaquo’, ‘203A’),
b’\x9c’: (‘oelig’, ‘153’),
b’\x9d’: ‘?’,
b’\x9e’: (‘#x17E’, ’17E’),
b’\x9f’: (‘Yuml’, ”),}
# A parochial partial mapping of ISO-Latin-1 to ASCII. Contains
# horrors like stripping diacritical marks to turn á into a, but also
# contains non-horrors like turning “ into “.
MS_CHARS_TO_ASCII = {
b’\x80′ : ‘EUR’,
b’\x81′ : ‘ ‘,
b’\x82’ : ‘,’,
b’\x83′ : ‘f’,
b’\x84′ : ‘,,’,
b’\x85′ : ‘…’,
b’\x86′ : ‘+’,
b’\x87′ : ‘++’,
b’\x88′ : ‘^’,
b’\x89′ : ‘%’,
b’\x8a’ : ‘S’,
b’\x8b’ : ‘<',
b'\x8c' : 'OE',
b'\x8d' : '?',
b'\x8e' : 'Z',
b'\x8f' : '?',
b'\x90' : '?',
b'\x91' : "'",
b'\x92' : "'",
b'\x93' : '"',
b'\x94' : '"',
b'\x95' : '*',
b'\x96' : '-',
b'\x97' : '--',
b'\x98' : '~',
b'\x99' : '(TM)',
b'\x9a' : 's',
b'\x9b' : '>‘,
b’\x9c’ : ‘oe’,
b’\x9d’ : ‘?’,
b’\x9e’ : ‘z’,
b’\x9f’ : ‘Y’,
b’\xa0′ : ‘ ‘,
b’\xa1’ : ‘!’,
b’\xa2′ : ‘c’,
b’\xa3′ : ‘GBP’,
b’\xa4′ : ‘$’, #This approximation is especially parochial–this is the
#generic currency symbol.
b’\xa5′ : ‘YEN’,
b’\xa6′ : ‘|’,
b’\xa7′ : ‘S’,
b’\xa8′ : ‘..’,
b’\xa9′ : ”,
b’\xaa’ : ‘(th)’,
b’\xab’ : ‘<<',
b'\xac' : '!',
b'\xad' : ' ',
b'\xae' : '(R)',
b'\xaf' : '-',
b'\xb0' : 'o',
b'\xb1' : '+-',
b'\xb2' : '2',
b'\xb3' : '3',
b'\xb4' : ("'", 'acute'),
b'\xb5' : 'u',
b'\xb6' : 'P',
b'\xb7' : '*',
b'\xb8' : ',',
b'\xb9' : '1',
b'\xba' : '(th)',
b'\xbb' : '>>’,
b’\xbc’ : ‘1/4′,
b’\xbd’ : ‘1/2′,
b’\xbe’ : ‘3/4′,
b’\xbf’ : ‘?’,
b’\xc0′ : ‘A’,
b’\xc1′ : ‘A’,
b’\xc2′ : ‘A’,
b’\xc3′ : ‘A’,
b’\xc4′ : ‘A’,
b’\xc5′ : ‘A’,
b’\xc6′ : ‘AE’,
b’\xc7′ : ‘C’,
b’\xc8′ : ‘E’,
b’\xc9′ : ‘E’,
b’\xca’ : ‘E’,
b’\xcb’ : ‘E’,
b’\xcc’ : ‘I’,
b’\xcd’ : ‘I’,
b’\xce’ : ‘I’,
b’\xcf’ : ‘I’,
b’\xd0′ : ‘D’,
b’\xd1′ : ‘N’,
b’\xd2′ : ‘O’,
b’\xd3′ : ‘O’,
b’\xd4′ : ‘O’,
b’\xd5′ : ‘O’,
b’\xd6′ : ‘O’,
b’\xd7′ : ‘*’,
b’\xd8′ : ‘O’,
b’\xd9′ : ‘U’,
b’\xda’ : ‘U’,
b’\xdb’ : ‘U’,
b’\xdc’ : ‘U’,
b’\xdd’ : ‘Y’,
b’\xde’ : ‘b’,
b’\xdf’ : ‘B’,
b’\xe0′ : ‘a’,
b’\xe1′ : ‘a’,
b’\xe2′ : ‘a’,
b’\xe3′ : ‘a’,
b’\xe4′ : ‘a’,
b’\xe5′ : ‘a’,
b’\xe6′ : ‘ae’,
b’\xe7′ : ‘c’,
b’\xe8′ : ‘e’,
b’\xe9′ : ‘e’,
b’\xea’ : ‘e’,
b’\xeb’ : ‘e’,
b’\xec’ : ‘i’,
b’\xed’ : ‘i’,
b’\xee’ : ‘i’,
b’\xef’ : ‘i’,
b’\xf0′ : ‘o’,
b’\xf1′ : ‘n’,
b’\xf2′ : ‘o’,
b’\xf3′ : ‘o’,
b’\xf4′ : ‘o’,
b’\xf5′ : ‘o’,
b’\xf6′ : ‘o’,
b’\xf7′ : ‘/’,
b’\xf8′ : ‘o’,
b’\xf9′ : ‘u’,
b’\xfa’ : ‘u’,
b’\xfb’ : ‘u’,
b’\xfc’ : ‘u’,
b’\xfd’ : ‘y’,
b’\xfe’ : ‘b’,
b’\xff’ : ‘y’,
}
# A map used when removing rogue Windows-1252/ISO-8859-1
# characters in otherwise UTF-8 documents.
#
# Note that \x81, \x8d, \x8f, \x90, and \x9d are undefined in
# Windows-1252.
WINDOWS_1252_TO_UTF8 = {
0x80 : b’\xe2\x82\xac’, # €
0x82 : b’\xe2\x80\x9a’, # ‚
0x83 : b’\xc6\x92′, # Æ’
0x84 : b’\xe2\x80\x9e’, # „
0x85 : b’\xe2\x80\xa6′, # …
0x86 : b’\xe2\x80\xa0′, # â€
0x87 : b’\xe2\x80\xa1′, # ‡
0x88 : b’\xcb\x86′, # ˆ
0x89 : b’\xe2\x80\xb0′, # ‰
0x8a : b’\xc5\xa0′, # Å
0x8b : b’\xe2\x80\xb9′, # ‹
0x8c : b’\xc5\x92′, # Å’
0x8e : b’\xc5\xbd’, # Ž
0x91 : b’\xe2\x80\x98′, # ‘
0x92 : b’\xe2\x80\x99′, # ’
0x93 : b’\xe2\x80\x9c’, # “
0x94 : b’\xe2\x80\x9d’, # â€
0x95 : b’\xe2\x80\xa2′, # •
0x96 : b’\xe2\x80\x93′, # –
0x97 : b’\xe2\x80\x94′, # —
0x98 : b’\xcb\x9c’, # Ëœ
0x99 : b’\xe2\x84\xa2′, # â„¢
0x9a : b’\xc5\xa1′, # Å¡
0x9b : b’\xe2\x80\xba’, # ›
0x9c : b’\xc5\x93′, # Å“
0x9e : b’\xc5\xbe’, # ž
0x9f : b’\xc5\xb8′, # Ÿ
0xa0 : b’\xc2\xa0′, # Â
0xa1 : b’\xc2\xa1′, # ¡
0xa2 : b’\xc2\xa2′, # ¢
0xa3 : b’\xc2\xa3′, # £
0xa4 : b’\xc2\xa4′, # ¤
0xa5 : b’\xc2\xa5′, # Â¥
0xa6 : b’\xc2\xa6′, # ¦
0xa7 : b’\xc2\xa7′, # §
0xa8 : b’\xc2\xa8′, # ¨
0xa9 : b’\xc2\xa9′, # ©
0xaa : b’\xc2\xaa’, # ª
0xab : b’\xc2\xab’, # «
0xac : b’\xc2\xac’, # ¬
0xad : b’\xc2\xad’, # Â
0xae : b’\xc2\xae’, # ®
0xaf : b’\xc2\xaf’, # ¯
0xb0 : b’\xc2\xb0′, # °
0xb1 : b’\xc2\xb1′, # ±
0xb2 : b’\xc2\xb2′, # ²
0xb3 : b’\xc2\xb3′, # ³
0xb4 : b’\xc2\xb4′, # ´
0xb5 : b’\xc2\xb5′, # µ
0xb6 : b’\xc2\xb6′, # ¶
0xb7 : b’\xc2\xb7′, # ·
0xb8 : b’\xc2\xb8′, # ¸
0xb9 : b’\xc2\xb9′, # ¹
0xba : b’\xc2\xba’, # º
0xbb : b’\xc2\xbb’, # »
0xbc : b’\xc2\xbc’, # ¼
0xbd : b’\xc2\xbd’, # ½
0xbe : b’\xc2\xbe’, # ¾
0xbf : b’\xc2\xbf’, # ¿
0xc0 : b’\xc3\x80′, # À
0xc1 : b’\xc3\x81′, # Ã
0xc2 : b’\xc3\x82′, # Â
0xc3 : b’\xc3\x83′, # Ã
0xc4 : b’\xc3\x84′, # Ä
0xc5 : b’\xc3\x85′, # Ã…
0xc6 : b’\xc3\x86′, # Æ
0xc7 : b’\xc3\x87′, # Ç
0xc8 : b’\xc3\x88′, # È
0xc9 : b’\xc3\x89′, # É
0xca : b’\xc3\x8a’, # Ê
0xcb : b’\xc3\x8b’, # Ë
0xcc : b’\xc3\x8c’, # ÃŒ
0xcd : b’\xc3\x8d’, # Ã
0xce : b’\xc3\x8e’, # ÃŽ
0xcf : b’\xc3\x8f’, # Ã
0xd0 : b’\xc3\x90′, # Ã
0xd1 : b’\xc3\x91′, # Ñ
0xd2 : b’\xc3\x92′, # Ã’
0xd3 : b’\xc3\x93′, # Ó
0xd4 : b’\xc3\x94′, # Ô
0xd5 : b’\xc3\x95′, # Õ
0xd6 : b’\xc3\x96′, # Ö
0xd7 : b’\xc3\x97′, # ×
0xd8 : b’\xc3\x98′, # Ø
0xd9 : b’\xc3\x99′, # Ù
0xda : b’\xc3\x9a’, # Ú
0xdb : b’\xc3\x9b’, # Û
0xdc : b’\xc3\x9c’, # Ü
0xdd : b’\xc3\x9d’, # Ã
0xde : b’\xc3\x9e’, # Þ
0xdf : b’\xc3\x9f’, # ß
0xe0 : b’\xc3\xa0′, # Ã
0xe1 : b’\xa1′, # á
0xe2 : b’\xc3\xa2′, # â
0xe3 : b’\xc3\xa3′, # ã
0xe4 : b’\xc3\xa4′, # ä
0xe5 : b’\xc3\xa5′, # Ã¥
0xe6 : b’\xc3\xa6′, # æ
0xe7 : b’\xc3\xa7′, # ç
0xe8 : b’\xc3\xa8′, # è
0xe9 : b’\xc3\xa9′, # é
0xea : b’\xc3\xaa’, # ê
0xeb : b’\xc3\xab’, # ë
0xec : b’\xc3\xac’, # ì
0xed : b’\xc3\xad’, # Ã
0xee : b’\xc3\xae’, # î
0xef : b’\xc3\xaf’, # ï
0xf0 : b’\xc3\xb0′, # ð
0xf1 : b’\xc3\xb1′, # ñ
0xf2 : b’\xc3\xb2′, # ò
0xf3 : b’\xc3\xb3′, # ó
0xf4 : b’\xc3\xb4′, # ô
0xf5 : b’\xc3\xb5′, # õ
0xf6 : b’\xc3\xb6′, # ö
0xf7 : b’\xc3\xb7′, # ÷
0xf8 : b’\xc3\xb8′, # ø
0xf9 : b’\xc3\xb9′, # ù
0xfa : b’\xc3\xba’, # ú
0xfb : b’\xc3\xbb’, # û
0xfc : b’\xc3\xbc’, # ü
0xfd : b’\xc3\xbd’, # ý
0xfe : b’\xc3\xbe’, # þ
}
MULTIBYTE_MARKERS_AND_SIZES = [
(0xc2, 0xdf, 2), # 2-byte characters start with a byte C2-DF
(0xe0, 0xef, 3), # 3-byte characters start with E0-EF
(0xf0, 0xf4, 4), # 4-byte characters start with F0-F4
]
FIRST_MULTIBYTE_MARKER = MULTIBYTE_MARKERS_AND_SIZES[0][0]
LAST_MULTIBYTE_MARKER = MULTIBYTE_MARKERS_AND_SIZES[-1][1]
@classmethod
def detwingle(cls, in_bytes, main_encoding=”utf8″,
embedded_encoding=”windows-1252″):
“””Fix characters from one encoding embedded in some other encoding.
Currently the only situation supported is Windows-1252 (or its
subset ISO-8859-1), embedded in UTF-8.
The input must be a bytestring. If you’ve already converted
the document to Unicode, you’re too late.
The output is a bytestring in which `embedded_encoding`
characters have been converted to their `main_encoding`
equivalents.
“””
if embedded_encoding.replace(‘_’, ‘-‘).lower() not in (
‘windows-1252’, ‘windows_1252’):
raise NotImplementedError(
“Windows-1252 and ISO-8859-1 are the only currently supported ”
“embedded encodings.”)
if main_encoding.lower() not in (‘utf8’, ‘utf-8’):
raise NotImplementedError(
“UTF-8 is the only currently supported main encoding.”)
byte_chunks = []
chunk_start = 0
pos = 0
while pos < len(in_bytes):
byte = in_bytes[pos]
if not isinstance(byte, int):
# Python 2.x
byte = ord(byte)
if (byte >= cls.FIRST_MULTIBYTE_MARKER
and byte <= cls.LAST_MULTIBYTE_MARKER):
# This is the start of a UTF-8 multibyte character. Skip
# to the end.
for start, end, size in cls.MULTIBYTE_MARKERS_AND_SIZES:
if byte >= start and byte <= end:
pos += size
break
elif byte >= 0x80 and byte in cls.WINDOWS_1252_TO_UTF8:
# We found a Windows-1252 character!
# Save the string up to this point as a chunk.
byte_chunks.append(in_bytes[chunk_start:pos])
# Now translate the Windows-1252 character into UTF-8
# and add it as another, one-byte chunk.
byte_chunks.append(cls.WINDOWS_1252_TO_UTF8[byte])
pos += 1
chunk_start = pos
else:
# Go on to the next character.
pos += 1
if chunk_start == 0:
# The string is unchanged.
return in_bytes
else:
# Store the final chunk.
byte_chunks.append(in_bytes[chunk_start:])
return b”.join(byte_chunks)
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/bs4/element.py
import collections
import re
import sys
import warnings
from bs4.dammit import EntitySubstitution
DEFAULT_OUTPUT_ENCODING = “utf-8”
PY3K = (sys.version_info[0] > 2)
whitespace_re = re.compile(“\s+”)
def _alias(attr):
“””Alias one attribute name to another for backward compatibility”””
@property
def alias(self):
return getattr(self, attr)
@alias.setter
def alias(self):
return setattr(self, attr)
return alias
class NamespacedAttribute(unicode):
def __new__(cls, prefix, name, namespace=None):
if name is None:
obj = unicode.__new__(cls, prefix)
else:
obj = unicode.__new__(cls, prefix + “:” + name)
obj.prefix = prefix
obj.name = name
obj.namespace = namespace
return obj
class AttributeValueWithCharsetSubstitution(unicode):
“””A stand-in object for a character encoding specified in HTML.”””
class CharsetMetaAttributeValue(AttributeValueWithCharsetSubstitution):
“””A generic stand-in for the value of a meta tag’s ‘charset’ attribute.
When Beautiful Soup parses the markup ‘‘, the
value of the ‘charset’ attribute will be one of these objects.
“””
def __new__(cls, original_value):
obj = unicode.__new__(cls, original_value)
obj.original_value = original_value
return obj
def encode(self, encoding):
return encoding
class ContentMetaAttributeValue(AttributeValueWithCharsetSubstitution):
“””A generic stand-in for the value of a meta tag’s ‘content’ attribute.
When Beautiful Soup parses the markup:
The value of the ‘content’ attribute will be one of these objects.
“””
CHARSET_RE = re.compile(“((^|;)\s*charset=)([^;]*)”, re.M)
def __new__(cls, original_value):
match = cls.CHARSET_RE.search(original_value)
if match is None:
# No substitution necessary.
return unicode.__new__(unicode, original_value)
obj = unicode.__new__(cls, original_value)
obj.original_value = original_value
return obj
def encode(self, encoding):
def rewrite(match):
return match.group(1) + encoding
return self.CHARSET_RE.sub(rewrite, self.original_value)
class PageElement(object):
“””Contains the navigational information for some part of the page
(either a tag or a piece of text)”””
# There are five possible values for the “formatter” argument passed in
# to methods like encode() and prettify():
#
# “html” – All Unicode characters with corresponding HTML entities
# are converted to those entities on output.
# “minimal” – Bare ampersands and angle brackets are converted to
# XML entities: & < >
# None – The null formatter. Unicode characters are never
# converted to entities. This is not recommended, but it’s
# faster than “minimal”.
# A function – This function will be called on every string that
# needs to undergo entity substition
FORMATTERS = {
“html” : EntitySubstitution.substitute_html,
“minimal” : EntitySubstitution.substitute_xml,
None : None
}
@classmethod
def format_string(self, s, formatter=’minimal’):
“””Format the given string using the given formatter.”””
if not callable(formatter):
formatter = self.FORMATTERS.get(
formatter, EntitySubstitution.substitute_xml)
if formatter is None:
output = s
else:
output = formatter(s)
return output
def setup(self, parent=None, previous_element=None):
“””Sets up the initial relations between this element and
other elements.”””
self.parent = parent
self.previous_element = previous_element
if previous_element is not None:
self.previous_element.next_element = self
self.next_element = None
self.previous_sibling = None
self.next_sibling = None
if self.parent is not None and self.parent.contents:
self.previous_sibling = self.parent.contents[-1]
self.previous_sibling.next_sibling = self
nextSibling = _alias(“next_sibling”) # BS3
previousSibling = _alias(“previous_sibling”) # BS3
def replace_with(self, replace_with):
if replace_with is self:
return
if replace_with is self.parent:
raise ValueError(“Cannot replace a Tag with its parent.”)
old_parent = self.parent
my_index = self.parent.index(self)
self.extract()
old_parent.insert(my_index, replace_with)
return self
replaceWith = replace_with # BS3
def unwrap(self):
my_parent = self.parent
my_index = self.parent.index(self)
self.extract()
for child in reversed(self.contents[:]):
my_parent.insert(my_index, child)
return self
replace_with_children = unwrap
replaceWithChildren = unwrap # BS3
def wrap(self, wrap_inside):
me = self.replace_with(wrap_inside)
wrap_inside.append(me)
return wrap_inside
def extract(self):
“””Destructively rips this element out of the tree.”””
if self.parent is not None:
del self.parent.contents[self.parent.index(self)]
#Find the two elements that would be next to each other if
#this element (and any children) hadn’t been parsed. Connect
#the two.
last_child = self._last_descendant()
next_element = last_child.next_element
if self.previous_element is not None:
self.previous_element.next_element = next_element
if next_element is not None:
next_element.previous_element = self.previous_element
self.previous_element = None
last_child.next_element = None
self.parent = None
if self.previous_sibling is not None:
self.previous_sibling.next_sibling = self.next_sibling
if self.next_sibling is not None:
self.next_sibling.previous_sibling = self.previous_sibling
self.previous_sibling = self.next_sibling = None
return self
def _last_descendant(self):
“Finds the last element beneath this object to be parsed.”
last_child = self
while hasattr(last_child, ‘contents’) and last_child.contents:
last_child = last_child.contents[-1]
return last_child
# BS3: Not part of the API!
_lastRecursiveChild = _last_descendant
def insert(self, position, new_child):
if new_child is self:
raise ValueError(“Cannot insert a tag into itself.”)
if (isinstance(new_child, basestring)
and not isinstance(new_child, NavigableString)):
new_child = NavigableString(new_child)
position = min(position, len(self.contents))
if hasattr(new_child, ‘parent’) and new_child.parent is not None:
# We’re ‘inserting’ an element that’s already one
# of this object’s children.
if new_child.parent is self:
current_index = self.index(new_child)
if current_index < position:
# We're moving this element further down the list
# of this object's children. That means that when
# we extract this element, our target index will
# jump down one.
position -= 1
new_child.extract()
new_child.parent = self
previous_child = None
if position == 0:
new_child.previous_sibling = None
new_child.previous_element = self
else:
previous_child = self.contents[position - 1]
new_child.previous_sibling = previous_child
new_child.previous_sibling.next_sibling = new_child
new_child.previous_element = previous_child._last_descendant()
if new_child.previous_element is not None:
new_child.previous_element.next_element = new_child
new_childs_last_element = new_child._last_descendant()
if position >= len(self.contents):
new_child.next_sibling = None
parent = self
parents_next_sibling = None
while parents_next_sibling is None and parent is not None:
parents_next_sibling = parent.next_sibling
parent = parent.parent
if parents_next_sibling is not None:
# We found the element that comes next in the document.
break
if parents_next_sibling is not None:
new_childs_last_element.next_element = parents_next_sibling
else:
# The last element of this tag is the last element in
# the document.
new_childs_last_element.next_element = None
else:
next_child = self.contents[position]
new_child.next_sibling = next_child
if new_child.next_sibling is not None:
new_child.next_sibling.previous_sibling = new_child
new_childs_last_element.next_element = next_child
if new_childs_last_element.next_element is not None:
new_childs_last_element.next_element.previous_element = new_childs_last_element
self.contents.insert(position, new_child)
def append(self, tag):
“””Appends the given tag to the contents of this tag.”””
self.insert(len(self.contents), tag)
def insert_before(self, predecessor):
“””Makes the given element the immediate predecessor of this one.
The two elements will have the same parent, and the given element
will be immediately before this one.
“””
if self is predecessor:
raise ValueError(“Can’t insert an element before itself.”)
parent = self.parent
if parent is None:
raise ValueError(
“Element has no parent, so ‘before’ has no meaning.”)
# Extract first so that the index won’t be screwed up if they
# are siblings.
if isinstance(predecessor, PageElement):
predecessor.extract()
index = parent.index(self)
parent.insert(index, predecessor)
def insert_after(self, successor):
“””Makes the given element the immediate successor of this one.
The two elements will have the same parent, and the given element
will be immediately after this one.
“””
if self is successor:
raise ValueError(“Can’t insert an element after itself.”)
parent = self.parent
if parent is None:
raise ValueError(
“Element has no parent, so ‘after’ has no meaning.”)
# Extract first so that the index won’t be screwed up if they
# are siblings.
if isinstance(successor, PageElement):
successor.extract()
index = parent.index(self)
parent.insert(index+1, successor)
def find_next(self, name=None, attrs={}, text=None, **kwargs):
“””Returns the first item that matches the given criteria and
appears after this Tag in the document.”””
return self._find_one(self.find_all_next, name, attrs, text, **kwargs)
findNext = find_next # BS3
def find_all_next(self, name=None, attrs={}, text=None, limit=None,
**kwargs):
“””Returns all items that match the given criteria and appear
after this Tag in the document.”””
return self._find_all(name, attrs, text, limit, self.next_elements,
**kwargs)
findAllNext = find_all_next # BS3
def find_next_sibling(self, name=None, attrs={}, text=None, **kwargs):
“””Returns the closest sibling to this Tag that matches the
given criteria and appears after this Tag in the document.”””
return self._find_one(self.find_next_siblings, name, attrs, text,
**kwargs)
findNextSibling = find_next_sibling # BS3
def find_next_siblings(self, name=None, attrs={}, text=None, limit=None,
**kwargs):
“””Returns the siblings of this Tag that match the given
criteria and appear after this Tag in the document.”””
return self._find_all(name, attrs, text, limit,
self.next_siblings, **kwargs)
findNextSiblings = find_next_siblings # BS3
fetchNextSiblings = find_next_siblings # BS2
def find_previous(self, name=None, attrs={}, text=None, **kwargs):
“””Returns the first item that matches the given criteria and
appears before this Tag in the document.”””
return self._find_one(
self.find_all_previous, name, attrs, text, **kwargs)
findPrevious = find_previous # BS3
def find_all_previous(self, name=None, attrs={}, text=None, limit=None,
**kwargs):
“””Returns all items that match the given criteria and appear
before this Tag in the document.”””
return self._find_all(name, attrs, text, limit, self.previous_elements,
**kwargs)
findAllPrevious = find_all_previous # BS3
fetchPrevious = find_all_previous # BS2
def find_previous_sibling(self, name=None, attrs={}, text=None, **kwargs):
“””Returns the closest sibling to this Tag that matches the
given criteria and appears before this Tag in the document.”””
return self._find_one(self.find_previous_siblings, name, attrs, text,
**kwargs)
findPreviousSibling = find_previous_sibling # BS3
def find_previous_siblings(self, name=None, attrs={}, text=None,
limit=None, **kwargs):
“””Returns the siblings of this Tag that match the given
criteria and appear before this Tag in the document.”””
return self._find_all(name, attrs, text, limit,
self.previous_siblings, **kwargs)
findPreviousSiblings = find_previous_siblings # BS3
fetchPreviousSiblings = find_previous_siblings # BS2
def find_parent(self, name=None, attrs={}, **kwargs):
“””Returns the closest parent of this Tag that matches the given
criteria.”””
# NOTE: We can’t use _find_one because findParents takes a different
# set of arguments.
r = None
l = self.find_parents(name, attrs, 1)
if l:
r = l[0]
return r
findParent = find_parent # BS3
def find_parents(self, name=None, attrs={}, limit=None, **kwargs):
“””Returns the parents of this Tag that match the given
criteria.”””
return self._find_all(name, attrs, None, limit, self.parents,
**kwargs)
findParents = find_parents # BS3
fetchParents = find_parents # BS2
@property
def next(self):
return self.next_element
@property
def previous(self):
return self.previous_element
#These methods do the real heavy lifting.
def _find_one(self, method, name, attrs, text, **kwargs):
r = None
l = method(name, attrs, text, 1, **kwargs)
if l:
r = l[0]
return r
def _find_all(self, name, attrs, text, limit, generator, **kwargs):
“Iterates over a generator looking for things that match.”
if isinstance(name, SoupStrainer):
strainer = name
elif text is None and not limit and not attrs and not kwargs:
# Optimization to find all tags.
if name is True or name is None:
return [element for element in generator
if isinstance(element, Tag)]
# Optimization to find all tags with a given name.
elif isinstance(name, basestring):
return [element for element in generator
if isinstance(element, Tag) and element.name == name]
else:
strainer = SoupStrainer(name, attrs, text, **kwargs)
else:
# Build a SoupStrainer
strainer = SoupStrainer(name, attrs, text, **kwargs)
results = ResultSet(strainer)
while True:
try:
i = next(generator)
except StopIteration:
break
if i:
found = strainer.search(i)
if found:
results.append(found)
if limit and len(results) >= limit:
break
return results
#These generators can be used to navigate starting from both
#NavigableStrings and Tags.
@property
def next_elements(self):
i = self.next_element
while i is not None:
yield i
i = i.next_element
@property
def next_siblings(self):
i = self.next_sibling
while i is not None:
yield i
i = i.next_sibling
@property
def previous_elements(self):
i = self.previous_element
while i is not None:
yield i
i = i.previous_element
@property
def previous_siblings(self):
i = self.previous_sibling
while i is not None:
yield i
i = i.previous_sibling
@property
def parents(self):
i = self.parent
while i is not None:
yield i
i = i.parent
# Methods for supporting CSS selectors.
tag_name_re = re.compile(‘^[a-z0-9]+$’)
# /^(\w+)\[(\w+)([=~\|\^\$\*]?)=?”?([^\]”]*)”?\]$/
# \—/ \—/\————-/ \——-/
# | | | |
# | | | The value
# | | ~,|,^,$,* or =
# | Attribute
# Tag
attribselect_re = re.compile(
r’^(?P
r’=?”?(?P
)
def _attr_value_as_string(self, value, default=None):
“””Force an attribute value into a string representation.
A multi-valued attribute will be converted into a
space-separated stirng.
“””
value = self.get(value, default)
if isinstance(value, list) or isinstance(value, tuple):
value =” “.join(value)
return value
def _attribute_checker(self, operator, attribute, value=”):
“””Create a function that performs a CSS selector operation.
Takes an operator, attribute and optional value. Returns a
function that will return True for elements that match that
combination.
“””
if operator == ‘=’:
# string representation of `attribute` is equal to `value`
return lambda el: el._attr_value_as_string(attribute) == value
elif operator == ‘~’:
# space-separated list representation of `attribute`
# contains `value`
def _includes_value(element):
attribute_value = element.get(attribute, [])
if not isinstance(attribute_value, list):
attribute_value = attribute_value.split()
return value in attribute_value
return _includes_value
elif operator == ‘^’:
# string representation of `attribute` starts with `value`
return lambda el: el._attr_value_as_string(
attribute, ”).startswith(value)
elif operator == ‘$’:
# string represenation of `attribute` ends with `value`
return lambda el: el._attr_value_as_string(
attribute, ”).endswith(value)
elif operator == ‘*’:
# string representation of `attribute` contains `value`
return lambda el: value in el._attr_value_as_string(attribute, ”)
elif operator == ‘|’:
# string representation of `attribute` is either exactly
# `value` or starts with `value` and then a dash.
def _is_or_starts_with_dash(element):
attribute_value = element._attr_value_as_string(attribute, ”)
return (attribute_value == value or attribute_value.startswith(
value + ‘-‘))
return _is_or_starts_with_dash
else:
return lambda el: el.has_attr(attribute)
def select(self, selector):
“””Perform a CSS selection operation on the current element.”””
tokens = selector.split()
current_context = [self]
for index, token in enumerate(tokens):
if tokens[index – 1] == ‘>’:
# already found direct descendants in last step. skip this
# step.
continue
m = self.attribselect_re.match(token)
if m is not None:
# Attribute selector
tag, attribute, operator, value = m.groups()
if not tag:
tag = True
checker = self._attribute_checker(operator, attribute, value)
found = []
for context in current_context:
found.extend(
[el for el in context.find_all(tag) if checker(el)])
current_context = found
continue
if ‘#’ in token:
# ID selector
tag, id = token.split(‘#’, 1)
if tag == “”:
tag = True
el = current_context[0].find(tag, {‘id’: id})
if el is None:
return [] # No match
current_context = [el]
continue
if ‘.’ in token:
# Class selector
tag_name, klass = token.split(‘.’, 1)
if not tag_name:
tag_name = True
classes = set(klass.split(‘.’))
found = []
def classes_match(tag):
if tag_name is not True and tag.name != tag_name:
return False
if not tag.has_attr(‘class’):
return False
return classes.issubset(tag[‘class’])
for context in current_context:
found.extend(context.find_all(classes_match))
current_context = found
continue
if token == ‘*’:
# Star selector
found = []
for context in current_context:
found.extend(context.findAll(True))
current_context = found
continue
if token == ‘>’:
# Child selector
tag = tokens[index + 1]
if not tag:
tag = True
found = []
for context in current_context:
found.extend(context.find_all(tag, recursive=False))
current_context = found
continue
# Here we should just have a regular tag
if not self.tag_name_re.match(token):
return []
found = []
for context in current_context:
found.extend(context.findAll(token))
current_context = found
return current_context
# Old non-property versions of the generators, for backwards
# compatibility with BS3.
def nextGenerator(self):
return self.next_elements
def nextSiblingGenerator(self):
return self.next_siblings
def previousGenerator(self):
return self.previous_elements
def previousSiblingGenerator(self):
return self.previous_siblings
def parentGenerator(self):
return self.parents
class NavigableString(unicode, PageElement):
PREFIX = ”
SUFFIX = ”
def __new__(cls, value):
“””Create a new NavigableString.
When unpickling a NavigableString, this method is called with
the string in DEFAULT_OUTPUT_ENCODING. That encoding needs to be
passed in to the superclass’s __new__ or the superclass won’t know
how to handle non-ASCII characters.
“””
if isinstance(value, unicode):
return unicode.__new__(cls, value)
return unicode.__new__(cls, value, DEFAULT_OUTPUT_ENCODING)
def __getnewargs__(self):
return (unicode(self),)
def __getattr__(self, attr):
“””text.string gives you text. This is for backwards
compatibility for Navigable*String, but for CData* it lets you
get the string without the CData wrapper.”””
if attr == ‘string’:
return self
else:
raise AttributeError(
“‘%s’ object has no attribute ‘%s'” % (
self.__class__.__name__, attr))
def output_ready(self, formatter=”minimal”):
output = self.format_string(self, formatter)
return self.PREFIX + output + self.SUFFIX
class PreformattedString(NavigableString):
“””A NavigableString not subject to the normal formatting rules.
The string will be passed into the formatter (to trigger side effects),
but the return value will be ignored.
“””
def output_ready(self, formatter=”minimal”):
“””CData strings are passed into the formatter.
But the return value is ignored.”””
self.format_string(self, formatter)
return self.PREFIX + self + self.SUFFIX
class CData(PreformattedString):
PREFIX = u’
class ProcessingInstruction(PreformattedString):
PREFIX = u’'
SUFFIX = u'?>‘
class Comment(PreformattedString):
PREFIX = u’‘
class Declaration(PreformattedString):
PREFIX = u’‘
class Doctype(PreformattedString):
@classmethod
def for_name_and_ids(cls, name, pub_id, system_id):
value = name
if pub_id is not None:
value += ‘ PUBLIC “%s”‘ % pub_id
if system_id is not None:
value += ‘ “%s”‘ % system_id
elif system_id is not None:
value += ‘ SYSTEM “%s”‘ % system_id
return Doctype(value)
PREFIX = u’\n’
class Tag(PageElement):
“””Represents a found HTML tag with its attributes and contents.”””
def __init__(self, parser=None, builder=None, name=None, namespace=None,
prefix=None, attrs=None, parent=None, previous=None):
“Basic constructor.”
if parser is None:
self.parser_class = None
else:
# We don’t actually store the parser object: that lets extracted
# chunks be garbage-collected.
self.parser_class = parser.__class__
if name is None:
raise ValueError(“No value provided for new tag’s name.”)
self.name = name
self.namespace = namespace
self.prefix = prefix
if attrs is None:
attrs = {}
elif builder.cdata_list_attributes:
attrs = builder._replace_cdata_list_attribute_values(
self.name, attrs)
else:
attrs = dict(attrs)
self.attrs = attrs
self.contents = []
self.setup(parent, previous)
self.hidden = False
# Set up any substitutions, such as the charset in a META tag.
if builder is not None:
builder.set_up_substitutions(self)
self.can_be_empty_element = builder.can_be_empty_element(name)
else:
self.can_be_empty_element = False
parserClass = _alias(“parser_class”) # BS3
@property
def is_empty_element(self):
“””Is this tag an empty-element tag? (aka a self-closing tag)
A tag that has contents is never an empty-element tag.
A tag that has no contents may or may not be an empty-element
tag. It depends on the builder used to create the tag. If the
builder has a designated list of empty-element tags, then only
a tag whose name shows up in that list is considered an
empty-element tag.
If the builder has no designated list of empty-element tags,
then any tag with no contents is an empty-element tag.
“””
return len(self.contents) == 0 and self.can_be_empty_element
isSelfClosing = is_empty_element # BS3
@property
def string(self):
“””Convenience property to get the single string within this tag.
:Return: If this tag has a single string child, return value
is that string. If this tag has no children, or more than one
child, return value is None. If this tag has one child tag,
return value is the ‘string’ attribute of the child tag,
recursively.
“””
if len(self.contents) != 1:
return None
child = self.contents[0]
if isinstance(child, NavigableString):
return child
return child.string
@string.setter
def string(self, string):
self.clear()
self.append(string.__class__(string))
def _all_strings(self, strip=False):
“””Yield all child strings, possibly stripping them.”””
for descendant in self.descendants:
if not isinstance(descendant, NavigableString):
continue
if strip:
descendant = descendant.strip()
if len(descendant) == 0:
continue
yield descendant
strings = property(_all_strings)
@property
def stripped_strings(self):
for string in self._all_strings(True):
yield string
def get_text(self, separator=””, strip=False):
“””
Get all child strings, concatenated using the given separator.
“””
return separator.join([s for s in self._all_strings(strip)])
getText = get_text
text = property(get_text)
def decompose(self):
“””Recursively destroys the contents of this tree.”””
self.extract()
i = self
while i is not None:
next = i.next_element
i.__dict__.clear()
i = next
def clear(self, decompose=False):
“””
Extract all children. If decompose is True, decompose instead.
“””
if decompose:
for element in self.contents[:]:
if isinstance(element, Tag):
element.decompose()
else:
element.extract()
else:
for element in self.contents[:]:
element.extract()
def index(self, element):
“””
Find the index of a child by identity, not value. Avoids issues with
tag.contents.index(element) getting the index of equal elements.
“””
for i, child in enumerate(self.contents):
if child is element:
return i
raise ValueError(“Tag.index: element not in tag”)
def get(self, key, default=None):
“””Returns the value of the ‘key’ attribute for the tag, or
the value given for ‘default’ if it doesn’t have that
attribute.”””
return self.attrs.get(key, default)
def has_attr(self, key):
return key in self.attrs
def __hash__(self):
return str(self).__hash__()
def __getitem__(self, key):
“””tag[key] returns the value of the ‘key’ attribute for the tag,
and throws an exception if it’s not there.”””
return self.attrs[key]
def __iter__(self):
“Iterating over a tag iterates over its contents.”
return iter(self.contents)
def __len__(self):
“The length of a tag is the length of its list of contents.”
return len(self.contents)
def __contains__(self, x):
return x in self.contents
def __nonzero__(self):
“A tag is non-None even if it has no contents.”
return True
def __setitem__(self, key, value):
“””Setting tag[key] sets the value of the ‘key’ attribute for the
tag.”””
self.attrs[key] = value
def __delitem__(self, key):
“Deleting tag[key] deletes all ‘key’ attributes for the tag.”
self.attrs.pop(key, None)
def __call__(self, *args, **kwargs):
“””Calling a tag like a function is the same as calling its
find_all() method. Eg. tag(‘a’) returns a list of all the A tags
found within this tag.”””
return self.find_all(*args, **kwargs)
def __getattr__(self, tag):
#print “Getattr %s.%s” % (self.__class__, tag)
if len(tag) > 3 and tag.endswith(‘Tag’):
# BS3: soup.aTag -> “soup.find(“a”)
tag_name = tag[:-3]
warnings.warn(
‘.%sTag is deprecated, use .find(“%s”) instead.’ % (
tag_name, tag_name))
return self.find(tag_name)
# We special case contents to avoid recursion.
elif not tag.startswith(“__”) and not tag==”contents”:
return self.find(tag)
raise AttributeError(
“‘%s’ object has no attribute ‘%s'” % (self.__class__, tag))
def __eq__(self, other):
“””Returns true iff this tag has the same name, the same attributes,
and the same contents (recursively) as the given tag.”””
if self is other:
return True
if (not hasattr(other, ‘name’) or
not hasattr(other, ‘attrs’) or
not hasattr(other, ‘contents’) or
self.name != other.name or
self.attrs != other.attrs or
len(self) != len(other)):
return False
for i, my_child in enumerate(self.contents):
if my_child != other.contents[i]:
return False
return True
def __ne__(self, other):
“””Returns true iff this tag is not identical to the other tag,
as defined in __eq__.”””
return not self == other
def __repr__(self, encoding=DEFAULT_OUTPUT_ENCODING):
“””Renders this tag as a string.”””
return self.encode(encoding)
def __unicode__(self):
return self.decode()
def __str__(self):
return self.encode()
if PY3K:
__str__ = __repr__ = __unicode__
def encode(self, encoding=DEFAULT_OUTPUT_ENCODING,
indent_level=None, formatter=”minimal”,
errors=”xmlcharrefreplace”):
# Turn the data structure into Unicode, then encode the
# Unicode.
u = self.decode(indent_level, encoding, formatter)
return u.encode(encoding, errors)
def decode(self, indent_level=None,
eventual_encoding=DEFAULT_OUTPUT_ENCODING,
formatter=”minimal”):
“””Returns a Unicode representation of this tag and its contents.
:param eventual_encoding: The tag is destined to be
encoded into this encoding. This method is _not_
responsible for performing that encoding. This information
is passed in so that it can be substituted in if the
document contains a tag that mentions the document’s
encoding.
“””
attrs = []
if self.attrs:
for key, val in sorted(self.attrs.items()):
if val is None:
decoded = key
else:
if isinstance(val, list) or isinstance(val, tuple):
val = ‘ ‘.join(val)
elif not isinstance(val, basestring):
val = str(val)
elif (
isinstance(val, AttributeValueWithCharsetSubstitution)
and eventual_encoding is not None):
val = val.encode(eventual_encoding)
text = self.format_string(val, formatter)
decoded = (
str(key) + ‘=’
+ EntitySubstitution.quoted_attribute_value(text))
attrs.append(decoded)
close = ”
closeTag = ”
if self.is_empty_element:
close = ‘/’
else:
closeTag = ‘%s>‘ % self.name
prefix = ”
if self.prefix:
prefix = self.prefix + “:”
pretty_print = (indent_level is not None)
if pretty_print:
space = (‘ ‘ * (indent_level – 1))
indent_contents = indent_level + 1
else:
space = ”
indent_contents = None
contents = self.decode_contents(
indent_contents, eventual_encoding, formatter)
if self.hidden:
# This is the ‘document root’ object.
s = contents
else:
s = []
attribute_string = ”
if attrs:
attribute_string = ‘ ‘ + ‘ ‘.join(attrs)
if pretty_print:
s.append(space)
s.append(‘<%s%s%s%s>‘ % (
prefix, self.name, attribute_string, close))
if pretty_print:
s.append(“\n”)
s.append(contents)
if pretty_print and contents and contents[-1] != “\n”:
s.append(“\n”)
if pretty_print and closeTag:
s.append(space)
s.append(closeTag)
if pretty_print and closeTag and self.next_sibling:
s.append(“\n”)
s = ”.join(s)
return s
def prettify(self, encoding=None, formatter=”minimal”):
if encoding is None:
return self.decode(True, formatter=formatter)
else:
return self.encode(encoding, True, formatter=formatter)
def decode_contents(self, indent_level=None,
eventual_encoding=DEFAULT_OUTPUT_ENCODING,
formatter=”minimal”):
“””Renders the contents of this tag as a Unicode string.
:param eventual_encoding: The tag is destined to be
encoded into this encoding. This method is _not_
responsible for performing that encoding. This information
is passed in so that it can be substituted in if the
document contains a tag that mentions the document’s
encoding.
“””
pretty_print = (indent_level is not None)
s = []
for c in self:
text = None
if isinstance(c, NavigableString):
text = c.output_ready(formatter)
elif isinstance(c, Tag):
s.append(c.decode(indent_level, eventual_encoding,
formatter))
if text and indent_level:
text = text.strip()
if text:
if pretty_print:
s.append(” ” * (indent_level – 1))
s.append(text)
if pretty_print:
s.append(“\n”)
return ”.join(s)
def encode_contents(
self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING,
formatter=”minimal”):
“””Renders the contents of this tag as a bytestring.”””
contents = self.decode_contents(indent_level, encoding, formatter)
return contents.encode(encoding)
# Old method for BS3 compatibility
def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
prettyPrint=False, indentLevel=0):
if not prettyPrint:
indentLevel = None
return self.encode_contents(
indent_level=indentLevel, encoding=encoding)
#Soup methods
def find(self, name=None, attrs={}, recursive=True, text=None,
**kwargs):
“””Return only the first child of this Tag matching the given
criteria.”””
r = None
l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
if l:
r = l[0]
return r
findChild = find
def find_all(self, name=None, attrs={}, recursive=True, text=None,
limit=None, **kwargs):
“””Extracts a list of Tag objects that match the given
criteria. You can specify the name of the Tag and any
attributes you want the Tag to have.
The value of a key-value pair in the ‘attrs’ map can be a
string, a list of strings, a regular expression object, or a
callable that takes a string and returns whether or not the
string matches for some custom definition of ‘matches’. The
same is true of the tag name.”””
generator = self.descendants
if not recursive:
generator = self.children
return self._find_all(name, attrs, text, limit, generator, **kwargs)
findAll = find_all # BS3
findChildren = find_all # BS2
#Generator methods
@property
def children(self):
# return iter() to make the purpose of the method clear
return iter(self.contents) # XXX This seems to be untested.
@property
def descendants(self):
if not len(self.contents):
return
stopNode = self._last_descendant().next_element
current = self.contents[0]
while current is not stopNode:
yield current
current = current.next_element
# Old names for backwards compatibility
def childGenerator(self):
return self.children
def recursiveChildGenerator(self):
return self.descendants
# This was kind of misleading because has_key() (attributes) was
# different from __in__ (contents). has_key() is gone in Python 3,
# anyway.
has_key = has_attr
# Next, a couple classes to represent queries and their results.
class SoupStrainer(object):
“””Encapsulates a number of ways of matching a markup element (tag or
text).”””
def __init__(self, name=None, attrs={}, text=None, **kwargs):
self.name = self._normalize_search_value(name)
if not isinstance(attrs, dict):
# Treat a non-dict value for attrs as a search for the ‘class’
# attribute.
kwargs[‘class’] = attrs
attrs = None
if kwargs:
if attrs:
attrs = attrs.copy()
attrs.update(kwargs)
else:
attrs = kwargs
normalized_attrs = {}
for key, value in attrs.items():
normalized_attrs[key] = self._normalize_search_value(value)
self.attrs = normalized_attrs
self.text = self._normalize_search_value(text)
def _normalize_search_value(self, value):
# Leave it alone if it’s a Unicode string, a callable, a
# regular expression, a boolean, or None.
if (isinstance(value, unicode) or callable(value) or hasattr(value, ‘match’)
or isinstance(value, bool) or value is None):
return value
# If it’s a bytestring, convert it to Unicode, treating it as UTF-8.
if isinstance(value, bytes):
return value.decode(“utf8”)
# If it’s listlike, convert it into a list of strings.
if hasattr(value, ‘__iter__’):
new_value = []
for v in value:
if (hasattr(v, ‘__iter__’) and not isinstance(v, bytes)
and not isinstance(v, unicode)):
# This is almost certainly the user’s mistake. In the
# interests of avoiding infinite loops, we’ll let
# it through as-is rather than doing a recursive call.
new_value.append(v)
else:
new_value.append(self._normalize_search_value(v))
return new_value
# Otherwise, convert it into a Unicode string.
# The unicode(str()) thing is so this will do the same thing on Python 2
# and Python 3.
return unicode(str(value))
def __str__(self):
if self.text:
return self.text
else:
return “%s|%s” % (self.name, self.attrs)
def search_tag(self, markup_name=None, markup_attrs={}):
found = None
markup = None
if isinstance(markup_name, Tag):
markup = markup_name
markup_attrs = markup
call_function_with_tag_data = (
isinstance(self.name, collections.Callable)
and not isinstance(markup_name, Tag))
if ((not self.name)
or call_function_with_tag_data
or (markup and self._matches(markup, self.name))
or (not markup and self._matches(markup_name, self.name))):
if call_function_with_tag_data:
match = self.name(markup_name, markup_attrs)
else:
match = True
markup_attr_map = None
for attr, match_against in list(self.attrs.items()):
if not markup_attr_map:
if hasattr(markup_attrs, ‘get’):
markup_attr_map = markup_attrs
else:
markup_attr_map = {}
for k, v in markup_attrs:
markup_attr_map[k] = v
attr_value = markup_attr_map.get(attr)
if not self._matches(attr_value, match_against):
match = False
break
if match:
if markup:
found = markup
else:
found = markup_name
if found and self.text and not self._matches(found.string, self.text):
found = None
return found
searchTag = search_tag
def search(self, markup):
# print ‘looking for %s in %s’ % (self, markup)
found = None
# If given a list of items, scan it for a text element that
# matches.
if hasattr(markup, ‘__iter__’) and not isinstance(markup, (Tag, basestring)):
for element in markup:
if isinstance(element, NavigableString) \
and self.search(element):
found = element
break
# If it’s a Tag, make sure its name or attributes match.
# Don’t bother with Tags if we’re searching for text.
elif isinstance(markup, Tag):
if not self.text or self.name or self.attrs:
found = self.search_tag(markup)
# If it’s text, make sure the text matches.
elif isinstance(markup, NavigableString) or \
isinstance(markup, basestring):
if not self.name and not self.attrs and self._matches(markup, self.text):
found = markup
else:
raise Exception(
“I don’t know how to match against a %s” % markup.__class__)
return found
def _matches(self, markup, match_against):
# print u”Matching %s against %s” % (markup, match_against)
result = False
if isinstance(markup, list) or isinstance(markup, tuple):
# This should only happen when searching a multi-valued attribute
# like ‘class’.
if (isinstance(match_against, unicode)
and ‘ ‘ in match_against):
# A bit of a special case. If they try to match “foo
# bar” on a multivalue attribute’s value, only accept
# the literal value “foo bar”
#
# XXX This is going to be pretty slow because we keep
# splitting match_against. But it shouldn’t come up
# too often.
return (whitespace_re.split(match_against) == markup)
else:
for item in markup:
if self._matches(item, match_against):
return True
return False
if match_against is True:
# True matches any non-None value.
return markup is not None
if isinstance(match_against, collections.Callable):
return match_against(markup)
# Custom callables take the tag as an argument, but all
# other ways of matching match the tag name as a string.
if isinstance(markup, Tag):
markup = markup.name
# Ensure that `markup` is either a Unicode string, or None.
markup = self._normalize_search_value(markup)
if markup is None:
# None matches None, False, an empty string, an empty list, and so on.
return not match_against
if isinstance(match_against, unicode):
# Exact string match
return markup == match_against
if hasattr(match_against, ‘match’):
# Regexp match
return match_against.search(markup)
if hasattr(match_against, ‘__iter__’):
# The markup must be an exact match against something
# in the iterable.
return markup in match_against
class ResultSet(list):
“””A ResultSet is just a list that keeps track of the SoupStrainer
that created it.”””
def __init__(self, source):
list.__init__([])
self.source = source
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/bs4/testing.py”””Helper classes for tests.”””
import copy
import functools
import unittest
from unittest import TestCase
from bs4 import BeautifulSoup
from bs4.element import (
CharsetMetaAttributeValue,
Comment,
ContentMetaAttributeValue,
Doctype,
SoupStrainer,
)
from bs4.builder import HTMLParserTreeBuilder
default_builder = HTMLParserTreeBuilder
class SoupTest(unittest.TestCase):
@property
def default_builder(self):
return default_builder()
def soup(self, markup, **kwargs):
“””Build a Beautiful Soup object from markup.”””
builder = kwargs.pop(‘builder’, self.default_builder)
return BeautifulSoup(markup, builder=builder, **kwargs)
def document_for(self, markup):
“””Turn an HTML fragment into a document.
The details depend on the builder.
“””
return self.default_builder.test_fragment_to_document(markup)
def assertSoupEquals(self, to_parse, compare_parsed_to=None):
builder = self.default_builder
obj = BeautifulSoup(to_parse, builder=builder)
if compare_parsed_to is None:
compare_parsed_to = to_parse
self.assertEqual(obj.decode(), self ument_for(compare_parsed_to))
class HTMLTreeBuilderSmokeTest(object):
“””A basic test of a treebuilder’s competence.
Any HTML treebuilder, present or future, should be able to pass
these tests. With invalid markup, there’s room for interpretation,
and different parsers can handle it differently. But with the
markup in these tests, there’s not much room for interpretation.
“””
def assertDoctypeHandled(self, doctype_fragment):
“””Assert that a given doctype string is handled correctly.”””
doctype_str, soup = self._document_with_doctype(doctype_fragment)
# Make sure a Doctype object was created.
doctype = soup.contents[0]
self.assertEqual(doctype.__class__, Doctype)
self.assertEqual(doctype, doctype_fragment)
self.assertEqual(str(soup)[:len(doctype_str)], doctype_str)
# Make sure that the doctype was correctly associated with the
# parse tree and that the rest of the document parsed.
self.assertEqual(soup.p.contents[0], ‘foo’)
def _document_with_doctype(self, doctype_fragment):
“””Generate and parse a document with the given doctype.”””
doctype = ” % doctype_fragment
markup = doctype + ‘\nfoo
‘
soup = self.soup(markup)
return doctype, soup
def test_normal_doctypes(self):
“””Make sure normal, everyday HTML doctypes are handled correctly.”””
self.assertDoctypeHandled(“html”)
self.assertDoctypeHandled(
‘html PUBLIC “-//W3C//DTD XHTML 1.0 Transitional//EN”‘)
def test_public_doctype_with_url(self):
doctype = ‘html PUBLIC “-//W3C//DTD XHTML 1.0 Transitional//EN” “http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd”‘
self.assertDoctypeHandled(doctype)
def test_system_doctype(self):
self.assertDoctypeHandled(‘foo SYSTEM “http://www.example.com/”‘)
def test_namespaced_system_doctype(self):
# We can handle a namespaced doctype with a system ID.
self.assertDoctypeHandled(‘xsl:stylesheet SYSTEM “htmlent.dtd”‘)
def test_namespaced_public_doctype(self):
# Test a namespaced doctype with a public id.
self.assertDoctypeHandled(‘xsl:stylesheet PUBLIC “htmlent.dtd”‘)
def test_real_xhtml_document(self):
“””A real XHTML document should come out more or less the same as it went in.”””
markup = b”””
Goodbye.”””
soup = self.soup(markup)
self.assertEqual(
soup.encode(“utf-8″).replace(b”\n”, b””),
markup.replace(b”\n”, b””))
def test_deepcopy(self):
“””Make sure you can copy the tree builder.
This is important because the builder is part of a
BeautifulSoup object, and we want to be able to copy that.
“””
copy.deepcopy(self.default_builder)
def test_p_tag_is_never_empty_element(self):
“””A tag is never designated as an empty-element tag.
Even if the markup shows it as an empty-element tag, it
shouldn’t be presented that way.
“””
soup = self.soup(”
“)
self.assertFalse(soup.p.is_empty_element)
self.assertEqual(str(soup.p), ”
“)
def test_unclosed_tags_get_closed(self):
“””A tag that’s not closed by the end of the document should be closed.
This applies to all tags except empty-element tags.
“””
self.assertSoupEquals(“”, ”
“)
self.assertSoupEquals(“”, “”)
self.assertSoupEquals(”
“, ”
“)
def test_br_is_always_empty_element_tag(self):
“””A
tag is designated as an empty-element tag.
Some parsers treat
as one
tag, some parsers as
two tags, but it should always be an empty-element tag.
“””
soup = self.soup(”
“)
self.assertTrue(soup.br.is_empty_element)
self.assertEqual(str(soup.br), ”
“)
def test_nested_formatting_elements(self):
self.assertSoupEquals(“”)
def test_comment(self):
# Comments are represented as Comment objects.
markup = “foobaz
”
self.assertSoupEquals(markup)
soup = self.soup(markup)
comment = soup.find(text=”foobar”)
self.assertEqual(comment.__class__, Comment)
def test_preserved_whitespace_in_pre_and_textarea(self):
“””Whitespace must be preserved in and tags.”””
self.assertSoupEquals(”
“)
self.assertSoupEquals(” woo “)
def test_nested_inline_elements(self):
“””Inline elements can be nested indefinitely.”””
b_tag = “Inside a B tag”
self.assertSoupEquals(b_tag)
nested_b_tag = “A nested tag
”
self.assertSoupEquals(nested_b_tag)
double_nested_b_tag = “A doubly nested tag
”
self.assertSoupEquals(nested_b_tag)
def test_nested_block_level_elements(self):
“””Block elements can be nested.”””
soup = self.soup(‘
Foo
‘)
blockquote = soup.blockquote
self.assertEqual(blockquote.p.b.string, ‘Foo’)
self.assertEqual(blockquote.b.string, ‘Foo’)
def test_correctly_nested_tables(self):
“””One table can go inside another one.”””
markup = (‘
‘
‘
‘
”
Here’s another table:”
‘
‘
‘
foo
‘
”)
self.assertSoupEquals(
markup,
‘
Here\’s another table:’
‘
foo
‘
‘
‘)
self.assertSoupEquals(
”
Foo
”
”
Bar
”
”
Baz
“)
def test_angle_brackets_in_attribute_values_are_escaped(self):
self.assertSoupEquals(”, ”)
def test_entities_in_attributes_converted_to_unicode(self):
expect = u’
‘
self.assertSoupEquals(‘
‘, expect)
self.assertSoupEquals(‘
‘, expect)
self.assertSoupEquals(‘
‘, expect)
def test_entities_in_text_converted_to_unicode(self):
expect = u’pi\N{LATIN SMALL LETTER N WITH TILDE}ata
‘
self.assertSoupEquals(“piñata
“, expect)
self.assertSoupEquals(“piñata
“, expect)
self.assertSoupEquals(“piñata
“, expect)
def test_quot_entity_converted_to_quotation_mark(self):
self.assertSoupEquals(“I said “good day!”
“,
‘I said “good day!”
‘)
def test_out_of_range_entity(self):
expect = u”\N{REPLACEMENT CHARACTER}”
self.assertSoupEquals(“”, expect)
self.assertSoupEquals(“”, expect)
self.assertSoupEquals(“빲�”, expect)
def test_basic_namespaces(self):
“””Parsers don’t need to *understand* namespaces, but at the
very least they should not choke on namespaces or lose
data.”””
markup = b’4’
soup = self.soup(markup)
self.assertEqual(markup, soup.encode())
html = soup.html
self.assertEqual(‘http://www.w3.org/1999/xhtml’, soup.html[‘xmlns’])
self.assertEqual(
‘http://www.w3.org/1998/Math/MathML’, soup.html[‘xmlns:mathml’])
self.assertEqual(
‘http://www.w3.org/2000/svg’, soup.html[‘xmlns:svg’])
def test_multivalued_attribute_value_becomes_list(self):
markup = b”
soup = self.soup(markup)
self.assertEqual([‘foo’, ‘bar’], soup.a[‘class’])
#
# Generally speaking, tests below this point are more tests of
# Beautiful Soup than tests of the tree builders. But parsers are
# weird, so we run these tests separately for every tree builder
# to detect any differences between them.
#
def test_soupstrainer(self):
“””Parsers should be able to work with SoupStrainers.”””
strainer = SoupStrainer(“b”)
soup = self.soup(“A bold statement”,
parse_only=strainer)
self.assertEqual(soup.decode(), “bold”)
def test_single_quote_attribute_values_become_double_quotes(self):
self.assertSoupEquals(“”,
”)
def test_attribute_values_with_nested_quotes_are_left_alone(self):
text = “””a”””
self.assertSoupEquals(text)
def test_attribute_values_with_double_nested_quotes_get_quoted(self):
text = “””a”””
soup = self.soup(text)
soup.foo[‘attr’] = ‘Brawls happen at “Bob\’s Bar”‘
self.assertSoupEquals(
soup.foo.decode(),
“””a”””)
def test_ampersand_in_attribute_value_gets_escaped(self):
self.assertSoupEquals(”,
”)
self.assertSoupEquals(
‘foo’,
‘foo’)
def test_escaped_ampersand_in_attribute_value_is_left_alone(self):
self.assertSoupEquals(”)
def test_entities_in_strings_converted_during_parsing(self):
# Both XML and HTML entities are converted to Unicode characters
# during parsing.
text = “<
”
expected = u”<
”
self.assertSoupEquals(text, expected)
def test_smart_quotes_converted_on_the_way_in(self):
# Microsoft smart quotes are converted to Unicode characters during
# parsing.
quote = b”\x91Foo\x92
”
soup = self.soup(quote)
self.assertEqual(
soup.p.string,
u”\N{LEFT SINGLE QUOTATION MARK}Foo\N{RIGHT SINGLE QUOTATION MARK}”)
def test_non_breaking_spaces_converted_on_the_way_in(self):
soup = self.soup(” “)
self.assertEqual(soup.a.string, u”\N{NO-BREAK SPACE}” * 2)
def test_entities_converted_on_the_way_out(self):
text = “<
”
expected = u”<
“.encode(“utf-8”)
soup = self.soup(text)
self.assertEqual(soup.p.encode(“utf-8”), expected)
def test_real_iso_latin_document(self):
# Smoke test of interrelated functionality, using an
# easy-to-understand document.
# Here it is in Unicode. Note that it claims to be in ISO-Latin-1.
unicode_html = u’Sacr\N{LATIN SMALL LETTER E WITH ACUTE} bleu!
‘
# That’s because we’re going to encode it into ISO-Latin-1, and use
# that to test.
iso_latin_html = unicode_html.encode(“iso-8859-1”)
# Parse the ISO-Latin-1 HTML.
soup = self.soup(iso_latin_html)
# Encode it to UTF-8.
result = soup.encode(“utf-8”)
# What do we expect the result to look like? Well, it would
# look like unicode_html, except that the META tag would say
# UTF-8 instead of ISO-Latin-1.
expected = unicode_html.replace(“ISO-Latin-1”, “utf-8”)
# And, of course, it would be in UTF-8, not Unicode.
expected = expected.encode(“utf-8″)
# Ta-da!
self.assertEqual(result, expected)
def test_real_shift_jis_document(self):
# Smoke test to make sure the parser can handle a document in
# Shift-JIS encoding, without choking.
shift_jis_html = (
b”
b’\x82\xb1\x82\xea\x82\xcdShift-JIS\x82\xc5\x83R\x81[\x83f’
b’\x83B\x83\x93\x83O\x82\xb3\x82\xea\x82\xbd\x93\xfa\x96{\x8c’
b’\xea\x82\xcc\x83t\x83@\x83C\x83\x8b\x82\xc5\x82\xb7\x81B’
b’
‘)
unicode_html = shift_jis_html.decode(“shift-jis”)
soup = self.soup(unicode_html)
# Make sure the parse tree is correctly encoded to various
# encodings.
self.assertEqual(soup.encode(“utf-8”), unicode_html.encode(“utf-8”))
self.assertEqual(soup.encode(“euc_jp”), unicode_html.encode(“euc_jp”))
def test_real_hebrew_document(self):
# A real-world test to make sure we can convert ISO-8859-9 (a
# Hebrew encoding) to UTF-8.
hebrew_document = b’Hebrew (ISO 8859-8) in Visual Directionality
\xed\xe5\xec\xf9’
soup = self.soup(
hebrew_document, from_encoding=”iso8859-8″)
self.assertEqual(soup.original_encoding, ‘iso8859-8’)
self.assertEqual(
soup.encode(‘utf-8’),
hebrew_document.decode(“iso8859-8”).encode(“utf-8″))
def test_meta_tag_reflects_current_encoding(self):
# Here’s the tag saying that a document is
# encoded in Shift-JIS.
meta_tag = (”)
# Here’s a document incorporating that meta tag.
shift_jis_html = (
‘\n%s\n’
”
‘Shift-JIS markup goes here.’) % meta_tag
soup = self.soup(shift_jis_html)
# Parse the document, and the charset is seemingly unaffected.
parsed_meta = soup.find(‘meta’, {‘http-equiv’: ‘Content-type’})
content = parsed_meta[‘content’]
self.assertEqual(‘text/html; charset=x-sjis’, content)
# But that value is actually a ContentMetaAttributeValue object.
self.assertTrue(isinstance(content, ContentMetaAttributeValue))
# And it will take on a value that reflects its current
# encoding.
self.assertEqual(‘text/html; charset=utf8’, content.encode(“utf8″))
# For the rest of the story, see TestSubstitutions in
# test_tree.py.
def test_html5_style_meta_tag_reflects_current_encoding(self):
# Here’s the tag saying that a document is
# encoded in Shift-JIS.
meta_tag = (”)
# Here’s a document incorporating that meta tag.
shift_jis_html = (
‘\n%s\n’
”
‘Shift-JIS markup goes here.’) % meta_tag
soup = self.soup(shift_jis_html)
# Parse the document, and the charset is seemingly unaffected.
parsed_meta = soup.find(‘meta’, id=”encoding”)
charset = parsed_meta[‘charset’]
self.assertEqual(‘x-sjis’, charset)
# But that value is actually a CharsetMetaAttributeValue object.
self.assertTrue(isinstance(charset, CharsetMetaAttributeValue))
# And it will take on a value that reflects its current
# encoding.
self.assertEqual(‘utf8’, charset.encode(“utf8”))
def test_tag_with_no_attributes_can_have_attributes_added(self):
data = self.soup(“text”)
data.a[‘foo’] = ‘bar’
self.assertEqual(‘text’, data.a.decode())
class XMLTreeBuilderSmokeTest(object):
def test_docstring_generated(self):
soup = self.soup(“”)
self.assertEqual(
soup.encode(), b’\n’)
def test_real_xhtml_document(self):
“””A real XHTML document should come out *exactly* the same as it went in.”””
markup = b”””
Goodbye.”””
soup = self.soup(markup)
self.assertEqual(
soup.encode(“utf-8”), markup)
def test_docstring_includes_correct_encoding(self):
soup = self.soup(“”)
self.assertEqual(
soup.encode(“latin1”),
b’\n’)
def test_large_xml_document(self):
“””A large XML document should come out the same as it went in.”””
markup = (b’\n’
+ b’0′ * (2**12)
+ b”)
soup = self.soup(markup)
self.assertEqual(soup.encode(“utf-8”), markup)
def test_tags_are_empty_element_if_and_only_if_they_are_empty(self):
self.assertSoupEquals(“”, ”
“)
self.assertSoupEquals(”
foo
“)
def test_namespaces_are_preserved(self):
markup = ‘This tag is in the a namespaceThis tag is in the b namespace’
soup = self.soup(markup)
root = soup.root
self.assertEqual(“http://example.com/”, root[‘xmlns:a’])
self.assertEqual(“http://example.net/”, root[‘xmlns:b’])
class HTML5TreeBuilderSmokeTest(HTMLTreeBuilderSmokeTest):
“””Smoke test for a tree builder that supports HTML5.”””
def test_real_xhtml_document(self):
# Since XHTML is not HTML5, HTML5 parsers are not tested to handle
# XHTML documents in any particular way.
pass
def test_html_tags_have_namespace(self):
markup = “”
soup = self.soup(markup)
self.assertEqual(“http://www.w3.org/1999/xhtml”, soup.a.namespace)
def test_svg_tags_have_namespace(self):
markup = ”
soup = self.soup(markup)
namespace = “http://www.w3.org/2000/svg”
self.assertEqual(namespace, soup.svg.namespace)
self.assertEqual(namespace, soup.circle.namespace)
def test_mathml_tags_have_namespace(self):
markup = ‘5’
soup = self.soup(markup)
namespace = ‘http://www.w3.org/1998/Math/MathML’
self.assertEqual(namespace, soup.math.namespace)
self.assertEqual(namespace, soup.msqrt.namespace)
def skipIf(condition, reason):
def nothing(test, *args, **kwargs):
return None
def decorator(test_item):
if condition:
return nothing
else:
return test_item
return decorator
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/bs4/tests/test_builder_registry.py
“””Tests of the builder registry.”””
import unittest
from bs4 import BeautifulSoup
from bs4.builder import (
builder_registry as registry,
HTMLParserTreeBuilder,
TreeBuilderRegistry,
)
try:
from bs4.builder import HTML5TreeBuilder
HTML5LIB_PRESENT = True
except ImportError:
HTML5LIB_PRESENT = False
try:
from bs4.builder import (
LXMLTreeBuilderForXML,
LXMLTreeBuilder,
)
LXML_PRESENT = True
except ImportError:
LXML_PRESENT = False
class BuiltInRegistryTest(unittest.TestCase):
“””Test the built-in registry with the default builders registered.”””
def test_combination(self):
if LXML_PRESENT:
self.assertEqual(registry.lookup(‘fast’, ‘html’),
LXMLTreeBuilder)
if LXML_PRESENT:
self.assertEqual(registry.lookup(‘permissive’, ‘xml’),
LXMLTreeBuilderForXML)
self.assertEqual(registry.lookup(‘strict’, ‘html’),
HTMLParserTreeBuilder)
if HTML5LIB_PRESENT:
self.assertEqual(registry.lookup(‘html5lib’, ‘html’),
HTML5TreeBuilder)
def test_lookup_by_markup_type(self):
if LXML_PRESENT:
self.assertEqual(registry.lookup(‘html’), LXMLTreeBuilder)
self.assertEqual(registry.lookup(‘xml’), LXMLTreeBuilderForXML)
else:
self.assertEqual(registry.lookup(‘xml’), None)
if HTML5LIB_PRESENT:
self.assertEqual(registry.lookup(‘html’), HTML5TreeBuilder)
else:
self.assertEqual(registry.lookup(‘html’), HTMLParserTreeBuilder)
def test_named_library(self):
if LXML_PRESENT:
self.assertEqual(registry.lookup(‘lxml’, ‘xml’),
LXMLTreeBuilderForXML)
self.assertEqual(registry.lookup(‘lxml’, ‘html’),
LXMLTreeBuilder)
if HTML5LIB_PRESENT:
self.assertEqual(registry.lookup(‘html5lib’),
HTML5TreeBuilder)
self.assertEqual(registry.lookup(‘html.parser’),
HTMLParserTreeBuilder)
def test_beautifulsoup_constructor_does_lookup(self):
# You can pass in a string.
BeautifulSoup(“”, features=”html”)
# Or a list of strings.
BeautifulSoup(“”, features=[“html”, “fast”])
# You’ll get an exception if BS can’t find an appropriate
# builder.
self.assertRaises(ValueError, BeautifulSoup,
“”, features=”no-such-feature”)
class RegistryTest(unittest.TestCase):
“””Test the TreeBuilderRegistry class in general.”””
def setUp(self):
self.registry = TreeBuilderRegistry()
def builder_for_features(self, *feature_list):
cls = type(‘Builder_’ + ‘_’.join(feature_list),
(object,), {‘features’ : feature_list})
self.registry.register(cls)
return cls
def test_register_with_no_features(self):
builder = self.builder_for_features()
# Since the builder advertises no features, you can’t find it
# by looking up features.
self.assertEqual(self.registry.lookup(‘foo’), None)
# But you can find it by doing a lookup with no features, if
# this happens to be the only registered builder.
self.assertEqual(self.registry.lookup(), builder)
def test_register_with_features_makes_lookup_succeed(self):
builder = self.builder_for_features(‘foo’, ‘bar’)
self.assertEqual(self.registry.lookup(‘foo’), builder)
self.assertEqual(self.registry.lookup(‘bar’), builder)
def test_lookup_fails_when_no_builder_implements_feature(self):
builder = self.builder_for_features(‘foo’, ‘bar’)
self.assertEqual(self.registry.lookup(‘baz’), None)
def test_lookup_gets_most_recent_registration_when_no_feature_specified(self):
builder1 = self.builder_for_features(‘foo’)
builder2 = self.builder_for_features(‘bar’)
self.assertEqual(self.registry.lookup(), builder2)
def test_lookup_fails_when_no_tree_builders_registered(self):
self.assertEqual(self.registry.lookup(), None)
def test_lookup_gets_most_recent_builder_supporting_all_features(self):
has_one = self.builder_for_features(‘foo’)
has_the_other = self.builder_for_features(‘bar’)
has_both_early = self.builder_for_features(‘foo’, ‘bar’, ‘baz’)
has_both_late = self.builder_for_features(‘foo’, ‘bar’, ‘quux’)
lacks_one = self.builder_for_features(‘bar’)
has_the_other = self.builder_for_features(‘foo’)
# There are two builders featuring ‘foo’ and ‘bar’, but
# the one that also features ‘quux’ was registered later.
self.assertEqual(self.registry.lookup(‘foo’, ‘bar’),
has_both_late)
# There is only one builder featuring ‘foo’, ‘bar’, and ‘baz’.
self.assertEqual(self.registry.lookup(‘foo’, ‘bar’, ‘baz’),
has_both_early)
def test_lookup_fails_when_cannot_reconcile_requested_features(self):
builder1 = self.builder_for_features(‘foo’, ‘bar’)
builder2 = self.builder_for_features(‘foo’, ‘baz’)
self.assertEqual(self.registry.lookup(‘bar’, ‘baz’), None)
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/bs4/tests/test_docs.py
“Test harness for doctests.”
# pylint: disable-msg=E0611,W0142
__metaclass__ = type
__all__ = [
‘additional_tests’,
]
import atexit
import doctest
import os
#from pkg_resources import (
# resource_filename, resource_exists, resource_listdir, cleanup_resources)
import unittest
DOCTEST_FLAGS = (
doctest.ELLIPSIS |
doctest.NORMALIZE_WHITESPACE |
doctest.REPORT_NDIFF)
# def additional_tests():
# “Run the doc tests (README.txt and docs/*, if any exist)”
# doctest_files = [
# os.path.abspath(resource_filename(‘bs4’, ‘README.txt’))]
# if resource_exists(‘bs4’, ‘docs’):
# for name in resource_listdir(‘bs4’, ‘docs’):
# if name.endswith(‘.txt’):
# doctest_files.append(
# os.path.abspath(
# resource_filename(‘bs4’, ‘docs/%s’ % name)))
# kwargs = dict(module_relative=False, optionflags=DOCTEST_FLAGS)
# atexit.register(cleanup_resources)
# return unittest.TestSuite((
# doctest.DocFileSuite(*doctest_files, **kwargs)))
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/bs4/tests/test_html5lib.py
“””Tests to ensure that the html5lib tree builder generates good trees.”””
import warnings
try:
from bs4.builder import HTML5TreeBuilder
HTML5LIB_PRESENT = True
except ImportError, e:
HTML5LIB_PRESENT = False
from bs4.element import SoupStrainer
from bs4.testing import (
HTML5TreeBuilderSmokeTest,
SoupTest,
skipIf,
)
@skipIf(
not HTML5LIB_PRESENT,
“html5lib seems not to be present, not testing its tree builder.”)
class HTML5LibBuilderSmokeTest(SoupTest, HTML5TreeBuilderSmokeTest):
“””See “HTML5TreeBuilderSmokeTest“.”””
@property
def default_builder(self):
return HTML5TreeBuilder()
def test_soupstrainer(self):
# The html5lib tree builder does not support SoupStrainers.
strainer = SoupStrainer(“b”)
markup = “
A bold statement.
”
with warnings.catch_warnings(record=True) as w:
soup = self.soup(markup, parse_only=strainer)
self.assertEqual(
soup.decode(), self ument_for(markup))
self.assertTrue(
“the html5lib tree builder doesn’t support parse_only” in
str(w[0].message))
def test_correctly_nested_tables(self):
“””html5lib inserts
markup = (‘
| Here’s another table:” ‘
|
| Here\’s another table:’ ‘
‘ |
‘)
self.assertSoupEquals(
“
| Foo |
| Bar |
| Baz |
“)
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/bs4/tests/test_htmlparser.py
“””Tests to ensure that the html.parser tree builder generates good
trees.”””
from bs4.testing import SoupTest, HTMLTreeBuilderSmokeTest
from bs4.builder import HTMLParserTreeBuilder
class HTMLParserTreeBuilderSmokeTest(SoupTest, HTMLTreeBuilderSmokeTest):
@property
def default_builder(self):
return HTMLParserTreeBuilder()
def test_namespaced_system_doctype(self):
# html.parser can’t handle namespaced doctypes, so skip this one.
pass
def test_namespaced_public_doctype(self):
# html.parser can’t handle namespaced doctypes, so skip this one.
pass
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/bs4/tests/test_lxml.py”””Tests to ensure that the lxml tree builder generates good trees.”””
import re
import warnings
try:
from bs4.builder import LXMLTreeBuilder, LXMLTreeBuilderForXML
LXML_PRESENT = True
except ImportError, e:
LXML_PRESENT = False
from bs4 import (
BeautifulSoup,
BeautifulStoneSoup,
)
from bs4.element import Comment, Doctype, SoupStrainer
from bs4.testing import skipIf
from bs4.tests import test_htmlparser
from bs4.testing import (
HTMLTreeBuilderSmokeTest,
XMLTreeBuilderSmokeTest,
SoupTest,
skipIf,
)
@skipIf(
not LXML_PRESENT,
“lxml seems not to be present, not testing its tree builder.”)
class LXMLTreeBuilderSmokeTest(SoupTest, HTMLTreeBuilderSmokeTest):
“””See “HTMLTreeBuilderSmokeTest“.”””
@property
def default_builder(self):
return LXMLTreeBuilder()
def test_out_of_range_entity(self):
self.assertSoupEquals(
“foobar
“, “foobar
“)
self.assertSoupEquals(
“foobar
“, “foobar
“)
self.assertSoupEquals(
“foo빲�bar
“, “foobar
“)
def test_beautifulstonesoup_is_xml_parser(self):
# Make sure that the deprecated BSS class uses an xml builder
# if one is installed.
with warnings.catch_warnings(record=False) as w:
soup = BeautifulStoneSoup(“”)
self.assertEqual(u””, unicode(soup.b))
def test_real_xhtml_document(self):
“””lxml strips the XML definition from an XHTML doc, which is fine.”””
markup = b”””
Goodbye.”””
soup = self.soup(markup)
self.assertEqual(
soup.encode(“utf-8″).replace(b”\n”, b”),
markup.replace(b’\n’, b”).replace(
b”, b”))
@skipIf(
not LXML_PRESENT,
“lxml seems not to be present, not testing its XML tree builder.”)
class LXMLXMLTreeBuilderSmokeTest(SoupTest, XMLTreeBuilderSmokeTest):
“””See “HTMLTreeBuilderSmokeTest“.”””
@property
def default_builder(self):
return LXMLTreeBuilderForXML()
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/bs4/tests/test_soup.py# -*- coding: utf-8 -*-
“””Tests of Beautiful Soup as a whole.”””
import unittest
from bs4 import (
BeautifulSoup,
BeautifulStoneSoup,
)
from bs4.element import (
CharsetMetaAttributeValue,
ContentMetaAttributeValue,
SoupStrainer,
NamespacedAttribute,
)
import bs4.dammit
from bs4.dammit import EntitySubstitution, UnicodeDammit
from bs4.testing import (
SoupTest,
skipIf,
)
import warnings
try:
from bs4.builder import LXMLTreeBuilder, LXMLTreeBuilderForXML
LXML_PRESENT = True
except ImportError, e:
LXML_PRESENT = False
class TestDeprecatedConstructorArguments(SoupTest):
def test_parseOnlyThese_renamed_to_parse_only(self):
with warnings.catch_warnings(record=True) as w:
soup = self.soup(“”, parseOnlyThese=SoupStrainer(“b”))
msg = str(w[0].message)
self.assertTrue(“parseOnlyThese” in msg)
self.assertTrue(“parse_only” in msg)
self.assertEqual(b””, soup.encode())
def test_fromEncoding_renamed_to_from_encoding(self):
with warnings.catch_warnings(record=True) as w:
utf8 = b”\xc3\xa9″
soup = self.soup(utf8, fromEncoding=”utf8″)
msg = str(w[0].message)
self.assertTrue(“fromEncoding” in msg)
self.assertTrue(“from_encoding” in msg)
self.assertEqual(“utf8”, soup.original_encoding)
def test_unrecognized_keyword_argument(self):
self.assertRaises(
TypeError, self.soup, “”, no_such_argument=True)
@skipIf(
not LXML_PRESENT,
“lxml not present, not testing BeautifulStoneSoup.”)
def test_beautifulstonesoup(self):
with warnings.catch_warnings(record=True) as w:
soup = BeautifulStoneSoup(“”)
self.assertTrue(isinstance(soup, BeautifulSoup))
self.assertTrue(“BeautifulStoneSoup class is deprecated”)
class TestSelectiveParsing(SoupTest):
def test_parse_with_soupstrainer(self):
markup = “NoYes
NoYes Yes”
strainer = SoupStrainer(“b”)
soup = self.soup(markup, parse_only=strainer)
self.assertEqual(soup.encode(), b”YesYes Yes”)
class TestEntitySubstitution(unittest.TestCase):
“””Standalone tests of the EntitySubstitution class.”””
def setUp(self):
self.sub = EntitySubstitution
def test_simple_html_substitution(self):
# Unicode characters corresponding to named HTML entites
# are substituted, and no others.
s = u”foo\u2200\N{SNOWMAN}\u00f5bar”
self.assertEqual(self.sub.substitute_html(s),
u”foo∀\N{SNOWMAN}õbar”)
def test_smart_quote_substitution(self):
# MS smart quotes are a common source of frustration, so we
# give them a special test.
quotes = b”\x91\x92foo\x93\x94″
dammit = UnicodeDammit(quotes)
self.assertEqual(self.sub.substitute_html(dammit.markup),
“‘’foo“””)
def test_xml_converstion_includes_no_quotes_if_make_quoted_attribute_is_false(self):
s = ‘Welcome to “my bar”‘
self.assertEqual(self.sub.substitute_xml(s, False), s)
def test_xml_attribute_quoting_normally_uses_double_quotes(self):
self.assertEqual(self.sub.substitute_xml(“Welcome”, True),
‘”Welcome”‘)
self.assertEqual(self.sub.substitute_xml(“Bob’s Bar”, True),
‘”Bob\’s Bar”‘)
def test_xml_attribute_quoting_uses_single_quotes_when_value_contains_double_quotes(self):
s = ‘Welcome to “my bar”‘
self.assertEqual(self.sub.substitute_xml(s, True),
“‘Welcome to \”my bar\”‘”)
def test_xml_attribute_quoting_escapes_single_quotes_when_value_contains_both_single_and_double_quotes(self):
s = ‘Welcome to “Bob\’s Bar”‘
self.assertEqual(
self.sub.substitute_xml(s, True),
‘”Welcome to “Bob\’s Bar””‘)
def test_xml_quotes_arent_escaped_when_value_is_not_being_quoted(self):
quoted = ‘Welcome to “Bob\’s Bar”‘
self.assertEqual(self.sub.substitute_xml(quoted), quoted)
def test_xml_quoting_handles_angle_brackets(self):
self.assertEqual(
self.sub.substitute_xml(“foo”),
“foo
def test_xml_quoting_handles_ampersands(self):
self.assertEqual(self.sub.substitute_xml(“AT&T”), “AT&T”)
def test_xml_quoting_ignores_ampersands_when_they_are_part_of_an_entity(self):
self.assertEqual(
self.sub.substitute_xml(“ÁT&T”),
“ÁT&T”)
def test_quotes_not_html_substituted(self):
“””There’s no need to do this except inside attribute values.”””
text = ‘Bob\’s “bar”‘
self.assertEqual(self.sub.substitute_html(text), text)
class TestEncodingConversion(SoupTest):
# Test Beautiful Soup’s ability to decode and encode from various
# encodings.
def setUp(self):
super(TestEncodingConversion, self).setUp()
self.unicode_data = u”Sacr\N{LATIN SMALL LETTER E WITH ACUTE} bleu!”
self.utf8_data = self.unicode_data.encode(“utf-8″)
# Just so you know what it looks like.
self.assertEqual(
self.utf8_data,
b”Sacr\xc3\xa9 bleu!”)
def test_ascii_in_unicode_out(self):
# ASCII input is converted to Unicode. The original_encoding
# attribute is set.
ascii = b”a”
soup_from_ascii = self.soup(ascii)
unicode_output = soup_from_ascii.decode()
self.assertTrue(isinstance(unicode_output, unicode))
self.assertEqual(unicode_output, self ument_for(ascii.decode()))
self.assertEqual(soup_from_ascii.original_encoding, “ascii”)
def test_unicode_in_unicode_out(self):
# Unicode input is left alone. The original_encoding attribute
# is not set.
soup_from_unicode = self.soup(self.unicode_data)
self.assertEqual(soup_from_unicode.decode(), self.unicode_data)
self.assertEqual(soup_from_unicode.foo.string, u’Sacr\xe9 bleu!’)
self.assertEqual(soup_from_unicode.original_encoding, None)
def test_utf8_in_unicode_out(self):
# UTF-8 input is converted to Unicode. The original_encoding
# attribute is set.
soup_from_utf8 = self.soup(self.utf8_data)
self.assertEqual(soup_from_utf8.decode(), self.unicode_data)
self.assertEqual(soup_from_utf8.foo.string, u’Sacr\xe9 bleu!’)
def test_utf8_out(self):
# The internal data structures can be encoded as UTF-8.
soup_from_unicode = self.soup(self.unicode_data)
self.assertEqual(soup_from_unicode.encode(‘utf-8′), self.utf8_data)
class TestUnicodeDammit(unittest.TestCase):
“””Standalone tests of Unicode, Dammit.”””
def test_smart_quotes_to_unicode(self):
markup = b”\x91\x92\x93\x94″
dammit = UnicodeDammit(markup)
self.assertEqual(
dammit.unicode_markup, u”\u2018\u2019\u201c\u201d”)
def test_smart_quotes_to_xml_entities(self):
markup = b”\x91\x92\x93\x94″
dammit = UnicodeDammit(markup, smart_quotes_to=”xml”)
self.assertEqual(
dammit.unicode_markup, “‘’“””)
def test_smart_quotes_to_html_entities(self):
markup = b”\x91\x92\x93\x94″
dammit = UnicodeDammit(markup, smart_quotes_to=”html”)
self.assertEqual(
dammit.unicode_markup, “‘’“””)
def test_smart_quotes_to_ascii(self):
markup = b”\x91\x92\x93\x94″
dammit = UnicodeDammit(markup, smart_quotes_to=”ascii”)
self.assertEqual(
dammit.unicode_markup, “””””””””)
def test_detect_utf8(self):
utf8 = b”\xc3\xa9″
dammit = UnicodeDammit(utf8)
self.assertEqual(dammit.unicode_markup, u’\xe9’)
self.assertEqual(dammit.original_encoding, ‘utf-8’)
def test_convert_hebrew(self):
hebrew = b”\xed\xe5\xec\xf9″
dammit = UnicodeDammit(hebrew, [“iso-8859-8″])
self.assertEqual(dammit.original_encoding, ‘iso-8859-8′)
self.assertEqual(dammit.unicode_markup, u’\u05dd\u05d5\u05dc\u05e9’)
def test_dont_see_smart_quotes_where_there_are_none(self):
utf_8 = b”\343\202\261\343\203\274\343\202\277\343\202\244 Watch”
dammit = UnicodeDammit(utf_8)
self.assertEqual(dammit.original_encoding, ‘utf-8’)
self.assertEqual(dammit.unicode_markup.encode(“utf-8″), utf_8)
def test_ignore_inappropriate_codecs(self):
utf8_data = u”Räksmörgås”.encode(“utf-8”)
dammit = UnicodeDammit(utf8_data, [“iso-8859-8″])
self.assertEqual(dammit.original_encoding, ‘utf-8’)
def test_ignore_invalid_codecs(self):
utf8_data = u”Räksmörgås”.encode(“utf-8″)
for bad_encoding in [‘.utf8’, ‘…’, ‘utF—16.!’]:
dammit = UnicodeDammit(utf8_data, [bad_encoding])
self.assertEqual(dammit.original_encoding, ‘utf-8′)
def test_detect_html5_style_meta_tag(self):
for data in (
b”,
b””,
b””,
b””):
dammit = UnicodeDammit(data, is_html=True)
self.assertEqual(
“euc-jp”, dammit.original_encoding)
def test_last_ditch_entity_replacement(self):
# This is a UTF-8 document that contains bytestrings
# completely incompatible with UTF-8 (ie. encoded with some other
# encoding).
#
# Since there is no consistent encoding for the document,
# Unicode, Dammit will eventually encode the document as UTF-8
# and encode the incompatible characters as REPLACEMENT
# CHARACTER.
#
# If chardet is installed, it will detect that the document
# can be converted into ISO-8859-1 without errors. This happens
# to be the wrong encoding, but it is a consistent encoding, so the
# code we’re testing here won’t run.
#
# So we temporarily disable chardet if it’s present.
doc = b”””\357\273\277
\330\250\330\252\330\261
\310\322\321\220\312\321\355\344″””
chardet = bs4.dammit.chardet
try:
bs4.dammit.chardet = None
with warnings.catch_warnings(record=True) as w:
dammit = UnicodeDammit(doc)
self.assertEqual(True, dammit.contains_replacement_characters)
self.assertTrue(u”\ufffd” in dammit.unicode_markup)
soup = BeautifulSoup(doc, “html.parser”)
self.assertTrue(soup.contains_replacement_characters)
msg = w[0].message
self.assertTrue(isinstance(msg, UnicodeWarning))
self.assertTrue(“Some characters could not be decoded” in str(msg))
finally:
bs4.dammit.chardet = chardet
def test_sniffed_xml_encoding(self):
# A document written in UTF-16LE will be converted by a different
# code path that sniffs the byte order markers.
data = b’\xff\xfe\x00\xe1\x00\xe9\x00\x00’
dammit = UnicodeDammit(data)
self.assertEqual(u”áé”, dammit.unicode_markup)
self.assertEqual(“utf-16le”, dammit.original_encoding)
def test_detwingle(self):
# Here’s a UTF8 document.
utf8 = (u”\N{SNOWMAN}” * 3).encode(“utf8″)
# Here’s a Windows-1252 document.
windows_1252 = (
u”\N{LEFT DOUBLE QUOTATION MARK}Hi, I like Windows!”
u”\N{RIGHT DOUBLE QUOTATION MARK}”).encode(“windows_1252”)
# Through some unholy alchemy, they’ve been stuck together.
doc = utf8 + windows_1252 + utf8
# The document can’t be turned into UTF-8:
self.assertRaises(UnicodeDecodeError, doc.decode, “utf8”)
# Unicode, Dammit thinks the whole document is Windows-1252,
# and decodes it into “☃☃☃“Hi, I like Windows!”☃☃☃”
# But if we run it through fix_embedded_windows_1252, it’s fixed:
fixed = UnicodeDammit.detwingle(doc)
self.assertEqual(
u”☃☃☃“Hi, I like Windows!”☃☃☃”, fixed.decode(“utf8″))
def test_detwingle_ignores_multibyte_characters(self):
# Each of these characters has a UTF-8 representation ending
# in \x93. \x93 is a smart quote if interpreted as
# Windows-1252. But our code knows to skip over multibyte
# UTF-8 characters, so they’ll survive the process unscathed.
for tricky_unicode_char in (
u”\N{LATIN SMALL LIGATURE OE}”, # 2-byte char ‘\xc5\x93’
u”\N{LATIN SUBSCRIPT SMALL LETTER X}”, # 3-byte char ‘\xe2\x82\x93′
u”\xf0\x90\x90\x93″, # This is a CJK character, not sure which one.
):
input = tricky_unicode_char.encode(“utf8”)
self.assertTrue(input.endswith(b’\x93’))
output = UnicodeDammit.detwingle(input)
self.assertEqual(output, input)
class TestNamedspacedAttribute(SoupTest):
def test_name_may_be_none(self):
a = NamespacedAttribute(“xmlns”, None)
self.assertEqual(a, “xmlns”)
def test_attribute_is_equivalent_to_colon_separated_string(self):
a = NamespacedAttribute(“a”, “b”)
self.assertEqual(“a:b”, a)
def test_attributes_are_equivalent_if_prefix_and_name_identical(self):
a = NamespacedAttribute(“a”, “b”, “c”)
b = NamespacedAttribute(“a”, “b”, “c”)
self.assertEqual(a, b)
# The actual namespace is not considered.
c = NamespacedAttribute(“a”, “b”, None)
self.assertEqual(a, c)
# But name and prefix are important.
d = NamespacedAttribute(“a”, “z”, “c”)
self.assertNotEqual(a, d)
e = NamespacedAttribute(“z”, “b”, “c”)
self.assertNotEqual(a, e)
class TestAttributeValueWithCharsetSubstitution(unittest.TestCase):
def test_content_meta_attribute_value(self):
value = CharsetMetaAttributeValue(“euc-jp”)
self.assertEqual(“euc-jp”, value)
self.assertEqual(“euc-jp”, value.original_value)
self.assertEqual(“utf8”, value.encode(“utf8”))
def test_content_meta_attribute_value(self):
value = ContentMetaAttributeValue(“text/html; charset=euc-jp”)
self.assertEqual(“text/html; charset=euc-jp”, value)
self.assertEqual(“text/html; charset=euc-jp”, value.original_value)
self.assertEqual(“text/html; charset=utf8”, value.encode(“utf8”))
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/bs4/tests/test_tree.py# -*- coding: utf-8 -*-
“””Tests for Beautiful Soup’s tree traversal methods.
The tree traversal methods are the main advantage of using Beautiful
Soup over just using a parser.
Different parsers will build different Beautiful Soup trees given the
same markup, but all Beautiful Soup trees can be traversed with the
methods tested here.
“””
import copy
import pickle
import re
import warnings
from bs4 import BeautifulSoup
from bs4.builder import (
builder_registry,
HTMLParserTreeBuilder,
)
from bs4.element import (
CData,
Doctype,
NavigableString,
SoupStrainer,
Tag,
)
from bs4.testing import (
SoupTest,
skipIf,
)
XML_BUILDER_PRESENT = (builder_registry.lookup(“xml”) is not None)
LXML_PRESENT = (builder_registry.lookup(“lxml”) is not None)
class TreeTest(SoupTest):
def assertSelects(self, tags, should_match):
“””Make sure that the given tags have the correct text.
This is used in tests that define a bunch of tags, each
containing a single string, and then select certain strings by
some mechanism.
“””
self.assertEqual([tag.string for tag in tags], should_match)
def assertSelectsIDs(self, tags, should_match):
“””Make sure that the given tags have the correct IDs.
This is used in tests that define a bunch of tags, each
containing a single string, and then select certain strings by
some mechanism.
“””
self.assertEqual([tag[‘id’] for tag in tags], should_match)
class TestFind(TreeTest):
“””Basic tests of the find() method.
find() just calls find_all() with limit=1, so it’s not tested all
that thouroughly here.
“””
def test_find_tag(self):
soup = self.soup(“1234”)
self.assertEqual(soup.find(“b”).string, “2”)
def test_unicode_text_find(self):
soup = self.soup(u’Räksmörgås
‘)
self.assertEqual(soup.find(text=u’Räksmörgås’), u’Räksmörgås’)
class TestFindAll(TreeTest):
“””Basic tests of the find_all() method.”””
def test_find_all_text_nodes(self):
“””You can search the tree for text nodes.”””
soup = self.soup(“Foobar\xbb”)
# Exact match.
self.assertEqual(soup.find_all(text=”bar”), [u”bar”])
# Match any of a number of strings.
self.assertEqual(
soup.find_all(text=[“Foo”, “bar”]), [u”Foo”, u”bar”])
# Match a regular expression.
self.assertEqual(soup.find_all(text=re.compile(‘.*’)),
[u”Foo”, u”bar”, u’\xbb’])
# Match anything.
self.assertEqual(soup.find_all(text=True),
[u”Foo”, u”bar”, u’\xbb’])
def test_find_all_limit(self):
“””You can limit the number of items returned by find_all.”””
soup = self.soup(“1
2
3
4
5”)
self.assertSelects(soup.find_all(‘a’, limit=3), [“1”, “2”, “3”])
self.assertSelects(soup.find_all(‘a’, limit=1), [“1”])
self.assertSelects(
soup.find_all(‘a’, limit=10), [“1”, “2”, “3”, “4”, “5”])
# A limit of 0 means no limit.
self.assertSelects(
soup.find_all(‘a’, limit=0), [“1”, “2”, “3”, “4”, “5”])
def test_calling_a_tag_is_calling_findall(self):
soup = self.soup(“123”)
self.assertSelects(soup(‘a’, limit=1), [“1″])
self.assertSelects(soup.b(id=”foo”), [“3”])
def test_find_all_with_self_referential_data_structure_does_not_cause_infinite_recursion(self):
soup = self.soup(“”)
# Create a self-referential list.
l = []
l.append(l)
# Without special code in _normalize_search_value, this would cause infinite
# recursion.
self.assertEqual([], soup.find_all(l))
class TestFindAllBasicNamespaces(TreeTest):
def test_find_by_namespaced_name(self):
soup = self.soup(‘4’)
self.assertEqual(“4”, soup.find(“mathml:msqrt”).string)
self.assertEqual(“a”, soup.find(attrs= { “svg:fill” : “red” }).name)
class TestFindAllByName(TreeTest):
“””Test ways of finding tags by tag name.”””
def setUp(self):
super(TreeTest, self).setUp()
self.tree = self.soup(“””
First tag.
Second tag.
Third Nested tag. tag.”””)
def test_find_all_by_tag_name(self):
# Find all the tags.
self.assertSelects(
self.tree.find_all(‘a’), [‘First tag.’, ‘Nested tag.’])
def test_find_all_by_name_and_text(self):
self.assertSelects(
self.tree.find_all(‘a’, text=’First tag.’), [‘First tag.’])
self.assertSelects(
self.tree.find_all(‘a’, text=True), [‘First tag.’, ‘Nested tag.’])
self.assertSelects(
self.tree.find_all(‘a’, text=re.compile(“tag”)),
[‘First tag.’, ‘Nested tag.’])
def test_find_all_on_non_root_element(self):
# You can call find_all on any node, not just the root.
self.assertSelects(self.tree.c.find_all(‘a’), [‘Nested tag.’])
def test_calling_element_invokes_find_all(self):
self.assertSelects(self.tree(‘a’), [‘First tag.’, ‘Nested tag.’])
def test_find_all_by_tag_strainer(self):
self.assertSelects(
self.tree.find_all(SoupStrainer(‘a’)),
[‘First tag.’, ‘Nested tag.’])
def test_find_all_by_tag_names(self):
self.assertSelects(
self.tree.find_all([‘a’, ‘b’]),
[‘First tag.’, ‘Second tag.’, ‘Nested tag.’])
def test_find_all_by_tag_dict(self):
self.assertSelects(
self.tree.find_all({‘a’ : True, ‘b’ : True}),
[‘First tag.’, ‘Second tag.’, ‘Nested tag.’])
def test_find_all_by_tag_re(self):
self.assertSelects(
self.tree.find_all(re.compile(‘^[ab]$’)),
[‘First tag.’, ‘Second tag.’, ‘Nested tag.’])
def test_find_all_with_tags_matching_method(self):
# You can define an oracle method that determines whether
# a tag matches the search.
def id_matches_name(tag):
return tag.name == tag.get(‘id’)
tree = self.soup(“””
Match 1.
Does not match.
Match 2.”””)
self.assertSelects(
tree.find_all(id_matches_name), [“Match 1.”, “Match 2.”])
class TestFindAllByAttribute(TreeTest):
def test_find_all_by_attribute_name(self):
# You can pass in keyword arguments to find_all to search by
# attribute.
tree = self.soup(“””
Matching a.
Non-matching Matching b.a.
“””)
self.assertSelects(tree.find_all(id=’first’),
[“Matching a.”, “Matching b.”])
def test_find_all_by_utf8_attribute_value(self):
peace = u”םולש”.encode(“utf8″)
data = u”.encode(“utf8”)
soup = self.soup(data)
self.assertEqual([soup.a], soup.find_all(title=peace))
self.assertEqual([soup.a], soup.find_all(title=peace.decode(“utf8”)))
self.assertEqual([soup.a], soup.find_all(title=[peace, “something else”]))
def test_find_all_by_attribute_dict(self):
# You can pass in a dictionary as the argument ‘attrs’. This
# lets you search for attributes like ‘name’ (a fixed argument
# to find_all) and ‘class’ (a reserved word in Python.)
tree = self.soup(“””
Name match.
Class match.
Non-match.
A tag called ‘name1’.
“””)
# This doesn’t do what you want.
self.assertSelects(tree.find_all(name=’name1′),
[“A tag called ‘name1’.”])
# This does what you want.
self.assertSelects(tree.find_all(attrs={‘name’ : ‘name1′}),
[“Name match.”])
# Passing class=’class2’ would cause a syntax error.
self.assertSelects(tree.find_all(attrs={‘class’ : ‘class2’}),
[“Class match.”])
def test_find_all_by_class(self):
# Passing in a string to ‘attrs’ will search the CSS class.
tree = self.soup(“””
Class 1.
Class 2.
Class 1.
Class 3 and 4.
“””)
self.assertSelects(tree.find_all(‘a’, ‘1’), [‘Class 1.’])
self.assertSelects(tree.find_all(attrs=’1′), [‘Class 1.’, ‘Class 1.’])
self.assertSelects(tree.find_all(‘c’, ‘3’), [‘Class 3 and 4.’])
self.assertSelects(tree.find_all(‘c’, ‘4’), [‘Class 3 and 4.’])
def test_find_by_class_when_multiple_classes_present(self):
tree = self.soup(“Found it”)
attrs = { ‘class’ : re.compile(“o”) }
f = tree.find_all(“gar”, attrs=attrs)
self.assertSelects(f, [“Found it”])
f = tree.find_all(“gar”, re.compile(“a”))
self.assertSelects(f, [“Found it”])
# Since the class is not the string “foo bar”, but the two
# strings “foo” and “bar”, this will not find anything.
attrs = { ‘class’ : re.compile(“o b”) }
f = tree.find_all(“gar”, attrs=attrs)
self.assertSelects(f, [])
def test_find_all_with_non_dictionary_for_attrs_finds_by_class(self):
soup = self.soup(“Found it”)
self.assertSelects(soup.find_all(“a”, re.compile(“ba”)), [“Found it”])
def big_attribute_value(value):
return len(value) > 3
self.assertSelects(soup.find_all(“a”, big_attribute_value), [])
def small_attribute_value(value):
return len(value) ‘)
a, a2 = soup.find_all(“a”)
self.assertEqual([a, a2], soup.find_all(“a”, “foo”))
self.assertEqual([a], soup.find_all(“a”, “bar”))
# If you specify the attribute as a string that contains a
# space, only that specific value will be found.
self.assertEqual([a], soup.find_all(“a”, “foo bar”))
self.assertEqual([], soup.find_all(“a”, “bar foo”))
def test_find_all_by_attribute_soupstrainer(self):
tree = self.soup(“””
Match.
Non-match.”””)
strainer = SoupStrainer(attrs={‘id’ : ‘first’})
self.assertSelects(tree.find_all(strainer), [‘Match.’])
def test_find_all_with_missing_atribute(self):
# You can pass in None as the value of an attribute to find_all.
# This will match tags that do not have that attribute set.
tree = self.soup(“””ID present.
No ID present.
ID is empty.”””)
self.assertSelects(tree.find_all(‘a’, id=None), [“No ID present.”])
def test_find_all_with_defined_attribute(self):
# You can pass in None as the value of an attribute to find_all.
# This will match tags that have that attribute set to any value.
tree = self.soup(“””ID present.
No ID present.
ID is empty.”””)
self.assertSelects(
tree.find_all(id=True), [“ID present.”, “ID is empty.”])
def test_find_all_with_numeric_attribute(self):
# If you search for a number, it’s treated as a string.
tree = self.soup(“””Unquoted attribute.
Quoted attribute.”””)
expected = [“Unquoted attribute.”, “Quoted attribute.”]
self.assertSelects(tree.find_all(id=1), expected)
self.assertSelects(tree.find_all(id=”1″), expected)
def test_find_all_with_list_attribute_values(self):
# You can pass a list of attribute values instead of just one,
# and you’ll get tags that match any of the values.
tree = self.soup(“””1
2
3
No ID.”””)
self.assertSelects(tree.find_all(id=[“1”, “3”, “4”]),
[“1”, “3”])
def test_find_all_with_regular_expression_attribute_value(self):
# You can pass a regular expression as an attribute value, and
# you’ll get tags whose values for that attribute match the
# regular expression.
tree = self.soup(“””One a.
Two as.
Mixed as and bs.
One b.
No ID.”””)
self.assertSelects(tree.find_all(id=re.compile(“^a+$”)),
[“One a.”, “Two as.”])
def test_find_by_name_and_containing_string(self):
soup = self.soup(“foobarfoo”)
a = soup.a
self.assertEqual([a], soup.find_all(“a”, text=”foo”))
self.assertEqual([], soup.find_all(“a”, text=”bar”))
self.assertEqual([], soup.find_all(“a”, text=”bar”))
def test_find_by_name_and_containing_string_when_string_is_buried(self):
soup = self.soup(“foo
foo”)
self.assertEqual(soup.find_all(“a”), soup.find_all(“a”, text=”foo”))
def test_find_by_attribute_and_containing_string(self):
soup = self.soup(‘foofoo’)
a = soup.a
self.assertEqual([a], soup.find_all(id=2, text=”foo”))
self.assertEqual([], soup.find_all(id=1, text=”bar”))
class TestIndex(TreeTest):
“””Test Tag.index”””
def test_index(self):
tree = self.soup(“””
Identical
Not identical
Identical
Identical with child
Also not identical
Identical with child
“””)
div = tree.div
for i, element in enumerate(div.contents):
self.assertEqual(i, div.index(element))
self.assertRaises(ValueError, tree.index, 1)
class TestParentOperations(TreeTest):
“””Test navigation and searching through an element’s parents.”””
def setUp(self):
super(TestParentOperations, self).setUp()
self.tree = self.soup(”’
Start here
”’)
self.start = self.tree.b
def test_parent(self):
self.assertEqual(self.start.parent[‘id’], ‘bottom’)
self.assertEqual(self.start.parent.parent[‘id’], ‘middle’)
self.assertEqual(self.start.parent.parent.parent[‘id’], ‘top’)
def test_parent_of_top_tag_is_soup_object(self):
top_tag = self.tree.contents[0]
self.assertEqual(top_tag.parent, self.tree)
def test_soup_object_has_no_parent(self):
self.assertEqual(None, self.tree.parent)
def test_find_parents(self):
self.assertSelectsIDs(
self.start.find_parents(‘ul’), [‘bottom’, ‘middle’, ‘top’])
self.assertSelectsIDs(
self.start.find_parents(‘ul’, id=”middle”), [‘middle’])
def test_find_parent(self):
self.assertEqual(self.start.find_parent(‘ul’)[‘id’], ‘bottom’)
def test_parent_of_text_element(self):
text = self.tree.find(text=”Start here”)
self.assertEqual(text.parent.name, ‘b’)
def test_text_element_find_parent(self):
text = self.tree.find(text=”Start here”)
self.assertEqual(text.find_parent(‘ul’)[‘id’], ‘bottom’)
def test_parent_generator(self):
parents = [parent[‘id’] for parent in self.start.parents
if parent is not None and ‘id’ in parent.attrs]
self.assertEqual(parents, [‘bottom’, ‘middle’, ‘top’])
class ProximityTest(TreeTest):
def setUp(self):
super(TreeTest, self).setUp()
self.tree = self.soup(
‘OneTwoThree’)
class TestNextOperations(ProximityTest):
def setUp(self):
super(TestNextOperations, self).setUp()
self.start = self.tree.b
def test_next(self):
self.assertEqual(self.start.next_element, “One”)
self.assertEqual(self.start.next_element.next_element[‘id’], “2”)
def test_next_of_last_item_is_none(self):
last = self.tree.find(text=”Three”)
self.assertEqual(last.next_element, None)
def test_next_of_root_is_none(self):
# The document root is outside the next/previous chain.
self.assertEqual(self.tree.next_element, None)
def test_find_all_next(self):
self.assertSelects(self.start.find_all_next(‘b’), [“Two”, “Three”])
self.start.find_all_next(id=3)
self.assertSelects(self.start.find_all_next(id=3), [“Three”])
def test_find_next(self):
self.assertEqual(self.start.find_next(‘b’)[‘id’], ‘2’)
self.assertEqual(self.start.find_next(text=”Three”), “Three”)
def test_find_next_for_text_element(self):
text = self.tree.find(text=”One”)
self.assertEqual(text.find_next(“b”).string, “Two”)
self.assertSelects(text.find_all_next(“b”), [“Two”, “Three”])
def test_next_generator(self):
start = self.tree.find(text=”Two”)
successors = [node for node in start.next_elements]
# There are two successors: the final tag and its text contents.
tag, contents = successors
self.assertEqual(tag[‘id’], ‘3’)
self.assertEqual(contents, “Three”)
class TestPreviousOperations(ProximityTest):
def setUp(self):
super(TestPreviousOperations, self).setUp()
self.end = self.tree.find(text=”Three”)
def test_previous(self):
self.assertEqual(self.end.previous_element[‘id’], “3”)
self.assertEqual(self.end.previous_element.previous_element, “Two”)
def test_previous_of_first_item_is_none(self):
first = self.tree.find(‘html’)
self.assertEqual(first.previous_element, None)
def test_previous_of_root_is_none(self):
# The document root is outside the next/previous chain.
# XXX This is broken!
#self.assertEqual(self.tree.previous_element, None)
pass
def test_find_all_previous(self):
# The tag containing the “Three” node is the predecessor
# of the “Three” node itself, which is why “Three” shows up
# here.
self.assertSelects(
self.end.find_all_previous(‘b’), [“Three”, “Two”, “One”])
self.assertSelects(self.end.find_all_previous(id=1), [“One”])
def test_find_previous(self):
self.assertEqual(self.end.find_previous(‘b’)[‘id’], ‘3’)
self.assertEqual(self.end.find_previous(text=”One”), “One”)
def test_find_previous_for_text_element(self):
text = self.tree.find(text=”Three”)
self.assertEqual(text.find_previous(“b”).string, “Three”)
self.assertSelects(
text.find_all_previous(“b”), [“Three”, “Two”, “One”])
def test_previous_generator(self):
start = self.tree.find(text=”One”)
predecessors = [node for node in start.previous_elements]
# There are four predecessors: the tag containing “One”
# the tag, the tag, and the tag.
b, body, head, html = predecessors
self.assertEqual(b[‘id’], ‘1’)
self.assertEqual(body.name, “body”)
self.assertEqual(head.name, “head”)
self.assertEqual(html.name, “html”)
class SiblingTest(TreeTest):
def setUp(self):
super(SiblingTest, self).setUp()
markup = ”’
”’
# All that whitespace looks good but makes the tests more
# difficult. Get rid of it.
markup = re.compile(“\n\s*”).sub(“”, markup)
self.tree = self.soup(markup)
class TestNextSibling(SiblingTest):
def setUp(self):
super(TestNextSibling, self).setUp()
self.start = self.tree.find(id=”1″)
def test_next_sibling_of_root_is_none(self):
self.assertEqual(self.tree.next_sibling, None)
def test_next_sibling(self):
self.assertEqual(self.start.next_sibling[‘id’], ‘2’)
self.assertEqual(self.start.next_sibling.next_sibling[‘id’], ‘3’)
# Note the difference between next_sibling and next_element.
self.assertEqual(self.start.next_element[‘id’], ‘1.1’)
def test_next_sibling_may_not_exist(self):
self.assertEqual(self.tree.html.next_sibling, None)
nested_span = self.tree.find(id=”1.1″)
self.assertEqual(nested_span.next_sibling, None)
last_span = self.tree.find(id=”4″)
self.assertEqual(last_span.next_sibling, None)
def test_find_next_sibling(self):
self.assertEqual(self.start.find_next_sibling(‘span’)[‘id’], ‘2’)
def test_next_siblings(self):
self.assertSelectsIDs(self.start.find_next_siblings(“span”),
[‘2’, ‘3’, ‘4’])
self.assertSelectsIDs(self.start.find_next_siblings(id=’3′), [‘3’])
def test_next_sibling_for_text_element(self):
soup = self.soup(“Foobarbaz”)
start = soup.find(text=”Foo”)
self.assertEqual(start.next_sibling.name, ‘b’)
self.assertEqual(start.next_sibling.next_sibling, ‘baz’)
self.assertSelects(start.find_next_siblings(‘b’), [‘bar’])
self.assertEqual(start.find_next_sibling(text=”baz”), “baz”)
self.assertEqual(start.find_next_sibling(text=”nonesuch”), None)
class TestPreviousSibling(SiblingTest):
def setUp(self):
super(TestPreviousSibling, self).setUp()
self.end = self.tree.find(id=”4″)
def test_previous_sibling_of_root_is_none(self):
self.assertEqual(self.tree.previous_sibling, None)
def test_previous_sibling(self):
self.assertEqual(self.end.previous_sibling[‘id’], ‘3’)
self.assertEqual(self.end.previous_sibling.previous_sibling[‘id’], ‘2’)
# Note the difference between previous_sibling and previous_element.
self.assertEqual(self.end.previous_element[‘id’], ‘3.1’)
def test_previous_sibling_may_not_exist(self):
self.assertEqual(self.tree.html.previous_sibling, None)
nested_span = self.tree.find(id=”1.1″)
self.assertEqual(nested_span.previous_sibling, None)
first_span = self.tree.find(id=”1″)
self.assertEqual(first_span.previous_sibling, None)
def test_find_previous_sibling(self):
self.assertEqual(self.end.find_previous_sibling(‘span’)[‘id’], ‘3’)
def test_previous_siblings(self):
self.assertSelectsIDs(self.end.find_previous_siblings(“span”),
[‘3’, ‘2’, ‘1’])
self.assertSelectsIDs(self.end.find_previous_siblings(id=’1′), [‘1’])
def test_previous_sibling_for_text_element(self):
soup = self.soup(“Foobarbaz”)
start = soup.find(text=”baz”)
self.assertEqual(start.previous_sibling.name, ‘b’)
self.assertEqual(start.previous_sibling.previous_sibling, ‘Foo’)
self.assertSelects(start.find_previous_siblings(‘b’), [‘bar’])
self.assertEqual(start.find_previous_sibling(text=”Foo”), “Foo”)
self.assertEqual(start.find_previous_sibling(text=”nonesuch”), None)
class TestTagCreation(SoupTest):
“””Test the ability to create new tags.”””
def test_new_tag(self):
soup = self.soup(“”)
new_tag = soup.new_tag(“foo”, bar=”baz”)
self.assertTrue(isinstance(new_tag, Tag))
self.assertEqual(“foo”, new_tag.name)
self.assertEqual(dict(bar=”baz”), new_tag.attrs)
self.assertEqual(None, new_tag.parent)
def test_tag_inherits_self_closing_rules_from_builder(self):
if XML_BUILDER_PRESENT:
xml_soup = BeautifulSoup(“”, “xml”)
xml_br = xml_soup.new_tag(“br”)
xml_p = xml_soup.new_tag(“p”)
# Both the
and tag are empty-element, just because
# they have no contents.
self.assertEqual(b”
“, xml_br.encode())
self.assertEqual(b”
“, xml_p.encode())
html_soup = BeautifulSoup(“”, “html”)
html_br = html_soup.new_tag(“br”)
html_p = html_soup.new_tag(“p”)
# The HTML builder users HTML’s rules about which tags are
# empty-element tags, and the new tags reflect these rules.
self.assertEqual(b”
“, html_br.encode())
self.assertEqual(b”
“, html_p.encode())
def test_new_string_creates_navigablestring(self):
soup = self.soup(“”)
s = soup.new_string(“foo”)
self.assertEqual(“foo”, s)
self.assertTrue(isinstance(s, NavigableString))
class TestTreeModification(SoupTest):
def test_attribute_modification(self):
soup = self.soup(”)
soup.a[‘id’] = 2
self.assertEqual(soup.decode(), self ument_for(”))
del(soup.a[‘id’])
self.assertEqual(soup.decode(), self ument_for(”))
soup.a[‘id2’] = ‘foo’
self.assertEqual(soup.decode(), self ument_for(”))
def test_new_tag_creation(self):
builder = builder_registry.lookup(‘html’)()
soup = self.soup(“”, builder=builder)
a = Tag(soup, builder, ‘a’)
ol = Tag(soup, builder, ‘ol’)
a[‘href’] = ‘http://foo.com/’
soup.body.insert(0, a)
soup.body.insert(1, ol)
self.assertEqual(
soup.body.encode(),
b’
‘)
def test_append_to_contents_moves_tag(self):
doc = “””Don’t leave me here.
Don\’t leave!
“””
soup = self.soup(doc)
second_para = soup.find(id=’2′)
bold = soup.b
# Move the tag to the end of the second paragraph.
soup.find(id=’2’).append(soup.b)
# The tag is now a child of the second paragraph.
self.assertEqual(bold.parent, second_para)
self.assertEqual(
soup.decode(), self ument_for(
‘Don\’t leave me .
\n’
‘Don\’t leave!here
‘))
def test_replace_with_returns_thing_that_was_replaced(self):
text = “”
soup = self.soup(text)
a = soup.a
new_a = a.replace_with(soup.c)
self.assertEqual(a, new_a)
def test_unwrap_returns_thing_that_was_replaced(self):
text = “”
soup = self.soup(text)
a = soup.a
new_a = a.unwrap()
self.assertEqual(a, new_a)
def test_replace_tag_with_itself(self):
text = “Foo
”
soup = self.soup(text)
c = soup.c
soup.c.replace_with(c)
self.assertEqual(soup.decode(), self ument_for(text))
def test_replace_tag_with_its_parent_raises_exception(self):
text = “”
soup = self.soup(text)
self.assertRaises(ValueError, soup.b.replace_with, soup.a)
def test_insert_tag_into_itself_raises_exception(self):
text = “”
soup = self.soup(text)
self.assertRaises(ValueError, soup.a.insert, 0, soup.a)
def test_replace_with_maintains_next_element_throughout(self):
soup = self.soup(‘
onethree
‘)
a = soup.a
b = a.contents[0]
# Make it so the tag has two text children.
a.insert(1, “two”)
# Now replace each one with the empty string.
left, right = a.contents
left.replaceWith(”)
right.replaceWith(”)
# The tag is still connected to the tree.
self.assertEqual(“three”, soup.b.string)
def test_replace_final_node(self):
soup = self.soup(“Argh!”)
soup.find(text=”Argh!”).replace_with(“Hooray!”)
new_text = soup.find(text=”Hooray!”)
b = soup.b
self.assertEqual(new_text.previous_element, b)
self.assertEqual(new_text.parent, b)
self.assertEqual(new_text.previous_element.next_element, new_text)
self.assertEqual(new_text.next_element, None)
def test_consecutive_text_nodes(self):
# A builder should never create two consecutive text nodes,
# but if you insert one next to another, Beautiful Soup will
# handle it correctly.
soup = self.soup(“Argh!”)
soup.b.insert(1, “Hooray!”)
self.assertEqual(
soup.decode(), self ument_for(
“Argh!Hooray!”))
new_text = soup.find(text=”Hooray!”)
self.assertEqual(new_text.previous_element, “Argh!”)
self.assertEqual(new_text.previous_element.next_element, new_text)
self.assertEqual(new_text.previous_sibling, “Argh!”)
self.assertEqual(new_text.previous_sibling.next_sibling, new_text)
self.assertEqual(new_text.next_sibling, None)
self.assertEqual(new_text.next_element, soup.c)
def test_insert_string(self):
soup = self.soup(“”)
soup.a.insert(0, “bar”)
soup.a.insert(0, “foo”)
# The string were added to the tag.
self.assertEqual([“foo”, “bar”], soup.a.contents)
# And they were converted to NavigableStrings.
self.assertEqual(soup.a.contents[0].next_element, “bar”)
def test_insert_tag(self):
builder = self.default_builder
soup = self.soup(
“Findlady!”, builder=builder)
magic_tag = Tag(soup, builder, ‘magictag’)
magic_tag.insert(0, “the”)
soup.a.insert(1, magic_tag)
self.assertEqual(
soup.decode(), self ument_for(
“Findthelady!”))
# Make sure all the relationships are hooked up correctly.
b_tag = soup.b
self.assertEqual(b_tag.next_sibling, magic_tag)
self.assertEqual(magic_tag.previous_sibling, b_tag)
find = b_tag.find(text=”Find”)
self.assertEqual(find.next_element, magic_tag)
self.assertEqual(magic_tag.previous_element, find)
c_tag = soup.c
self.assertEqual(magic_tag.next_sibling, c_tag)
self.assertEqual(c_tag.previous_sibling, magic_tag)
the = magic_tag.find(text=”the”)
self.assertEqual(the.parent, magic_tag)
self.assertEqual(the.next_element, c_tag)
self.assertEqual(c_tag.previous_element, the)
def test_append_child_thats_already_at_the_end(self):
data = “”
soup = self.soup(data)
soup.a.append(soup.b)
self.assertEqual(data, soup.decode())
def test_move_tag_to_beginning_of_parent(self):
data = “”
soup = self.soup(data)
soup.a.insert(0, soup.d)
self.assertEqual(“”, soup.decode())
def test_insert_works_on_empty_element_tag(self):
# This is a little strange, since most HTML parsers don’t allow
# markup like this to come through. But in general, we don’t
# know what the parser would or wouldn’t have allowed, so
# I’m letting this succeed for now.
soup = self.soup(”
“)
soup.br.insert(1, “Contents”)
self.assertEqual(str(soup.br), ”
Contents”)
def test_insert_before(self):
soup = self.soup(“foobar”)
soup.b.insert_before(“BAZ”)
soup.a.insert_before(“QUUX”)
self.assertEqual(
soup.decode(), self ument_for(“QUUXfooBAZbar”))
soup.a.insert_before(soup.b)
self.assertEqual(
soup.decode(), self ument_for(“QUUXbarfooBAZ”))
def test_insert_after(self):
soup = self.soup(“foobar”)
soup.b.insert_after(“BAZ”)
soup.a.insert_after(“QUUX”)
self.assertEqual(
soup.decode(), self ument_for(“fooQUUXbarBAZ”))
soup.b.insert_after(soup.a)
self.assertEqual(
soup.decode(), self ument_for(“QUUXbarfooBAZ”))
def test_insert_after_raises_valueerror_if_after_has_no_meaning(self):
soup = self.soup(“”)
tag = soup.new_tag(“a”)
string = soup.new_string(“”)
self.assertRaises(ValueError, string.insert_after, tag)
self.assertRaises(ValueError, soup.insert_after, tag)
self.assertRaises(ValueError, tag.insert_after, tag)
def test_insert_before_raises_valueerror_if_before_has_no_meaning(self):
soup = self.soup(“”)
tag = soup.new_tag(“a”)
string = soup.new_string(“”)
self.assertRaises(ValueError, string.insert_before, tag)
self.assertRaises(ValueError, soup.insert_before, tag)
self.assertRaises(ValueError, tag.insert_before, tag)
def test_replace_with(self):
soup = self.soup(
”
There’s no business like show business
“)
no, show = soup.find_all(‘b’)
show.replace_with(no)
self.assertEqual(
soup.decode(),
self ument_for(
“There’s business like no business
“))
self.assertEqual(show.parent, None)
self.assertEqual(no.parent, soup.p)
self.assertEqual(no.next_element, “no”)
self.assertEqual(no.next_sibling, ” business”)
def test_replace_first_child(self):
data = “”
soup = self.soup(data)
soup.b.replace_with(soup.c)
self.assertEqual(“”, soup.decode())
def test_replace_last_child(self):
data = “”
soup = self.soup(data)
soup.c.replace_with(soup.b)
self.assertEqual(“”, soup.decode())
def test_nested_tag_replace_with(self):
soup = self.soup(
“””Wereservetherighttorefuseservice”””)
# Replace the entire tag and its contents (“reserve the
# right”) with the tag (“refuse”).
remove_tag = soup.b
move_tag = soup.f
remove_tag.replace_with(move_tag)
self.assertEqual(
soup.decode(), self ument_for(
“Werefusetoservice”))
# The tag is now an orphan.
self.assertEqual(remove_tag.parent, None)
self.assertEqual(remove_tag.find(text=”right”).next_element, None)
self.assertEqual(remove_tag.previous_element, None)
self.assertEqual(remove_tag.next_sibling, None)
self.assertEqual(remove_tag.previous_sibling, None)
# The tag is now connected to the tag.
self.assertEqual(move_tag.parent, soup.a)
self.assertEqual(move_tag.previous_element, “We”)
self.assertEqual(move_tag.next_element.next_element, soup.e)
self.assertEqual(move_tag.next_sibling, None)
# The gap where the tag used to be has been mended, and
# the word “to” is now connected to the tag.
to_text = soup.find(text=”to”)
g_tag = soup.g
self.assertEqual(to_text.next_element, g_tag)
self.assertEqual(to_text.next_sibling, g_tag)
self.assertEqual(g_tag.previous_element, to_text)
self.assertEqual(g_tag.previous_sibling, to_text)
def test_unwrap(self):
tree = self.soup(“””
Unneeded formatting is unneeded
“””)
tree.em.unwrap()
self.assertEqual(tree.em, None)
self.assertEqual(tree.p.text, “Unneeded formatting is unneeded”)
def test_wrap(self):
soup = self.soup(“I wish I was bold.”)
value = soup.string.wrap(soup.new_tag(“b”))
self.assertEqual(value.decode(), “I wish I was bold.”)
self.assertEqual(
soup.decode(), self ument_for(“I wish I was bold.”))
def test_wrap_extracts_tag_from_elsewhere(self):
soup = self.soup(“I wish I was bold.”)
soup.b.next_sibling.wrap(soup.b)
self.assertEqual(
soup.decode(), self ument_for(“I wish I was bold.”))
def test_wrap_puts_new_contents_at_the_end(self):
soup = self.soup(“I like being bold.I wish I was bold.”)
soup.b.next_sibling.wrap(soup.b)
self.assertEqual(2, len(soup.b.contents))
self.assertEqual(
soup.decode(), self ument_for(
“I like being bold.I wish I was bold.”))
def test_extract(self):
soup = self.soup(
‘Some content. Nav crap
More content.’)
self.assertEqual(len(soup.body.contents), 3)
extracted = soup.find(id=”nav”).extract()
self.assertEqual(
soup.decode(), “Some content. More content.”)
self.assertEqual(extracted.decode(), ‘Nav crap
‘)
# The extracted tag is now an orphan.
self.assertEqual(len(soup.body.contents), 2)
self.assertEqual(extracted.parent, None)
self.assertEqual(extracted.previous_element, None)
self.assertEqual(extracted.next_element.next_element, None)
# The gap where the extracted tag used to be has been mended.
content_1 = soup.find(text=”Some content. “)
content_2 = soup.find(text=” More content.”)
self.assertEqual(content_1.next_element, content_2)
self.assertEqual(content_1.next_sibling, content_2)
self.assertEqual(content_2.previous_element, content_1)
self.assertEqual(content_2.previous_sibling, content_1)
def test_extract_distinguishes_between_identical_strings(self):
soup = self.soup(“foobar”)
foo_1 = soup.a.string
bar_1 = soup.b.string
foo_2 = soup.new_string(“foo”)
bar_2 = soup.new_string(“bar”)
soup.a.append(foo_2)
soup.b.append(bar_2)
# Now there are two identical strings in the tag, and two
# in the tag. Let’s remove the first “foo” and the second
# “bar”.
foo_1.extract()
bar_2.extract()
self.assertEqual(foo_2, soup.a.string)
self.assertEqual(bar_2, soup.b.string)
def test_clear(self):
“””Tag.clear()”””
soup = self.soup(”
String Italicized and another
“)
# clear using extract()
a = soup.a
soup.p.clear()
self.assertEqual(len(soup.p.contents), 0)
self.assertTrue(hasattr(a, “contents”))
# clear using decompose()
em = a.em
a.clear(decompose=True)
self.assertFalse(hasattr(em, “contents”))
def test_string_set(self):
“””Tag.string = ‘string'”””
soup = self.soup(” “)
soup.a.string = “foo”
self.assertEqual(soup.a.contents, [“foo”])
soup.b.string = “bar”
self.assertEqual(soup.b.contents, [“bar”])
def test_string_set_does_not_affect_original_string(self):
soup = self.soup(“foobar”)
soup.b.string = soup.c.string
self.assertEqual(soup.a.encode(), b”
barbar”)
def test_set_string_preserves_class_of_string(self):
soup = self.soup(“”)
cdata = CData(“foo”)
soup.a.string = cdata
self.assertTrue(isinstance(soup.a.string, CData))
class TestElementObjects(SoupTest):
“””Test various features of element objects.”””
def test_len(self):
“””The length of an element is its number of children.”””
soup = self.soup(“123”)
# The BeautifulSoup object itself contains one element: the
# tag.
self.assertEqual(len(soup.contents), 1)
self.assertEqual(len(soup), 1)
# The tag contains three elements: the text node “1”, the
# tag, and the text node “3”.
self.assertEqual(len(soup.top), 3)
self.assertEqual(len(soup.top.contents), 3)
def test_member_access_invokes_find(self):
“””Accessing a Python member .foo invokes find(‘foo’)”””
soup = self.soup(”)
self.assertEqual(soup.b, soup.find(‘b’))
self.assertEqual(soup.b.i, soup.find(‘b’).find(‘i’))
self.assertEqual(soup.a, None)
def test_deprecated_member_access(self):
soup = self.soup(”)
with warnings.catch_warnings(record=True) as w:
tag = soup.bTag
self.assertEqual(soup.b, tag)
self.assertEqual(
‘.bTag is deprecated, use .find(“b”) instead.’,
str(w[0].message))
def test_has_attr(self):
“””has_attr() checks for the presence of an attribute.
Please note note: has_attr() is different from
__in__. has_attr() checks the tag’s attributes and __in__
checks the tag’s chidlren.
“””
soup = self.soup(“”)
self.assertTrue(soup.foo.has_attr(‘attr’))
self.assertFalse(soup.foo.has_attr(‘attr2’))
def test_attributes_come_out_in_alphabetical_order(self):
markup = ”
self.assertSoupEquals(markup, ”)
def test_string(self):
# A tag that contains only a text node makes that node
# available as .string.
soup = self.soup(“foo”)
self.assertEqual(soup.b.string, ‘foo’)
def test_empty_tag_has_no_string(self):
# A tag with no children has no .stirng.
soup = self.soup(“”)
self.assertEqual(soup.b.string, None)
def test_tag_with_multiple_children_has_no_string(self):
# A tag with no children has no .string.
soup = self.soup(“foo”)
self.assertEqual(soup.b.string, None)
soup = self.soup(“foobar”)
self.assertEqual(soup.b.string, None)
# Even if all the children are strings, due to trickery,
# it won’t work–but this would be a good optimization.
soup = self.soup(“foo”)
soup.a.insert(1, “bar”)
self.assertEqual(soup.a.string, None)
def test_tag_with_recursive_string_has_string(self):
# A tag with a single child which has a .string inherits that
# .string.
soup = self.soup(“foo”)
self.assertEqual(soup.a.string, “foo”)
self.assertEqual(soup.string, “foo”)
def test_lack_of_string(self):
“””Only a tag containing a single text node has a .string.”””
soup = self.soup(“feo”)
self.assertFalse(soup.b.string)
soup = self.soup(“”)
self.assertFalse(soup.b.string)
def test_all_text(self):
“””Tag.text and Tag.get_text(sep=u””) -> all child text, concatenated”””
soup = self.soup(“ar t “)
self.assertEqual(soup.a.text, “ar t “)
self.assertEqual(soup.a.get_text(strip=True), “art”)
self.assertEqual(soup.a.get_text(“,”), “a,r, , t “)
self.assertEqual(soup.a.get_text(“,”, strip=True), “a,r,t”)
class TestCDAtaListAttributes(SoupTest):
“””Testing cdata-list attributes like ‘class’.
“””
def test_single_value_becomes_list(self):
soup = self.soup(“”)
self.assertEqual([“foo”],soup.a[‘class’])
def test_multiple_values_becomes_list(self):
soup = self.soup(”
“)
self.assertEqual([“foo”, “bar”], soup.a[‘class’])
def test_multiple_values_separated_by_weird_whitespace(self):
soup = self.soup(”
“)
self.assertEqual([“foo”, “bar”, “baz”],soup.a[‘class’])
def test_attributes_joined_into_string_on_output(self):
soup = self.soup(”
“)
self.assertEqual(b’
‘, soup.a.encode())
def test_accept_charset(self):
soup = self.soup(”)
self.assertEqual([‘ISO-8859-1’, ‘UTF-8’], soup.form[‘accept-charset’])
def test_cdata_attribute_applying_only_to_one_tag(self):
data = ”
soup = self.soup(data)
# We saw in another test that accept-charset is a cdata-list
# attribute for the tag. But it’s not a cdata-list
# attribute for any other tag.
self.assertEqual(‘ISO-8859-1 UTF-8’, soup.a[‘accept-charset’])
class TestPersistence(SoupTest):
“Testing features like pickle and deepcopy.”
def setUp(self):
super(TestPersistence, self).setUp()
self.page = “””
[removed]
foo
bar
“””
self.tree = self.soup(self.page)
def test_pickle_and_unpickle_identity(self):
# Pickling a tree, then unpickling it, yields a tree identical
# to the original.
dumped = pickle.dumps(self.tree, 2)
loaded = pickle.loads(dumped)
self.assertEqual(loaded.__class__, BeautifulSoup)
self.assertEqual(loaded.decode(), self.tree.decode())
def test_deepcopy_identity(self):
# Making a deepcopy of a tree yields an identical tree.
copied = copy.deepcopy(self.tree)
self.assertEqual(copied.decode(), self.tree.decode())
def test_unicode_pickle(self):
# A tree containing Unicode characters can be pickled.
html = u”\N{SNOWMAN}”
soup = self.soup(html)
dumped = pickle.dumps(soup, pickle.HIGHEST_PROTOCOL)
loaded = pickle.loads(dumped)
self.assertEqual(loaded.decode(), soup.decode())
class TestSubstitutions(SoupTest):
def test_default_formatter_is_minimal(self):
markup = u”<
soup = self.soup(markup)
decoded = soup.decode(formatter=”minimal”)
# The < is converted back into < but the e-with-acute is left alone.
self.assertEqual(
decoded,
self ument_for(
u"<
def test_formatter_html(self):
markup = u”<
soup = self.soup(markup)
decoded = soup.decode(formatter=”html”)
self.assertEqual(
decoded,
self ument_for(“<
def test_formatter_minimal(self):
markup = u”<
soup = self.soup(markup)
decoded = soup.decode(formatter=”minimal”)
# The < is converted back into < but the e-with-acute is left alone.
self.assertEqual(
decoded,
self ument_for(
u"<
def test_formatter_null(self):
markup = u”<
soup = self.soup(markup)
decoded = soup.decode(formatter=None)
# Neither the angle brackets nor the e-with-acute are converted.
# This is not valid HTML, but it’s what the user wanted.
self.assertEqual(decoded,
self ument_for(u”>”))
def test_formatter_custom(self):
markup = u”
soup = self.soup(markup)
decoded = soup.decode(formatter = lambda x: x.upper())
# Instead of normal entity conversion code, the custom
# callable is called on every string.
self.assertEqual(
decoded,
self ument_for(u”BAR”))
def test_formatter_is_run_on_attribute_values(self):
markup = u’e’
soup = self.soup(markup)
a = soup.a
expect_minimal = u’e’
self.assertEqual(expect_minimal, a.decode())
self.assertEqual(expect_minimal, a.decode(formatter=”minimal”))
expect_html = u’e’
self.assertEqual(expect_html, a.decode(formatter=”html”))
self.assertEqual(markup, a.decode(formatter=None))
expect_upper = u’E’
self.assertEqual(expect_upper, a.decode(formatter=lambda x: x.upper()))
def test_prettify_accepts_formatter(self):
soup = BeautifulSoup(“foo”)
pretty = soup.prettify(formatter = lambda x: x.upper())
self.assertTrue(“FOO” in pretty)
def test_prettify_outputs_unicode_by_default(self):
soup = self.soup(“”)
self.assertEqual(unicode, type(soup.prettify()))
def test_prettify_can_encode_data(self):
soup = self.soup(“”)
self.assertEqual(bytes, type(soup.prettify(“utf-8″)))
def test_html_entity_substitution_off_by_default(self):
markup = u”Sacr\N{LATIN SMALL LETTER E WITH ACUTE} bleu!”
soup = self.soup(markup)
encoded = soup.b.encode(“utf-8″)
self.assertEqual(encoded, markup.encode(‘utf-8’))
def test_encoding_substitution(self):
# Here’s the tag saying that a document is
# encoded in Shift-JIS.
meta_tag = (”)
soup = self.soup(meta_tag)
# Parse the document, and the charset apprears unchanged.
self.assertEqual(soup.meta[‘content’], ‘text/html; charset=x-sjis’)
# Encode the document into some encoding, and the encoding is
# substituted into the meta tag.
utf_8 = soup.encode(“utf-8″)
self.assertTrue(b”charset=utf-8” in utf_8)
euc_jp = soup.encode(“euc_jp”)
self.assertTrue(b”charset=euc_jp” in euc_jp)
shift_jis = soup.encode(“shift-jis”)
self.assertTrue(b”charset=shift-jis” in shift_jis)
utf_16_u = soup.encode(“utf-16”).decode(“utf-16”)
self.assertTrue(“charset=utf-16” in utf_16_u)
def test_encoding_substitution_doesnt_happen_if_tag_is_strained(self):
markup = (‘foo
‘)
# Beautiful Soup used to try to rewrite the meta tag even if the
# meta tag got filtered out by the strainer. This test makes
# sure that doesn’t happen.
strainer = SoupStrainer(‘pre’)
soup = self.soup(markup, parse_only=strainer)
self.assertEqual(soup.contents[0].name, ‘pre’)
class TestEncoding(SoupTest):
“””Test the ability to encode objects into strings.”””
def test_unicode_string_can_be_encoded(self):
html = u”\N{SNOWMAN}”
soup = self.soup(html)
self.assertEqual(soup.b.string.encode(“utf-8″),
u”\N{SNOWMAN}”.encode(“utf-8″))
def test_tag_containing_unicode_string_can_be_encoded(self):
html = u”\N{SNOWMAN}”
soup = self.soup(html)
self.assertEqual(
soup.b.encode(“utf-8”), html.encode(“utf-8″))
def test_encoding_substitutes_unrecognized_characters_by_default(self):
html = u”\N{SNOWMAN}”
soup = self.soup(html)
self.assertEqual(soup.b.encode(“ascii”), b”☃”)
def test_encoding_can_be_made_strict(self):
html = u”\N{SNOWMAN}”
soup = self.soup(html)
self.assertRaises(
UnicodeEncodeError, soup.encode, “ascii”, errors=”strict”)
def test_decode_contents(self):
html = u”\N{SNOWMAN}”
soup = self.soup(html)
self.assertEqual(u”\N{SNOWMAN}”, soup.b.decode_contents())
def test_encode_contents(self):
html = u”\N{SNOWMAN}”
soup = self.soup(html)
self.assertEqual(
u”\N{SNOWMAN}”.encode(“utf8″), soup.b.encode_contents(
encoding=”utf8″))
def test_deprecated_renderContents(self):
html = u”\N{SNOWMAN}”
soup = self.soup(html)
self.assertEqual(
u”\N{SNOWMAN}”.encode(“utf8”), soup.b.renderContents())
class TestNavigableStringSubclasses(SoupTest):
def test_cdata(self):
# None of the current builders turn CDATA sections into CData
# objects, but you can create them manually.
soup = self.soup(“”)
cdata = CData(“foo”)
soup.insert(1, cdata)
self.assertEqual(str(soup), “foo”)
self.assertEqual(soup.find(text=”foo”), “foo”)
self.assertEqual(soup.contents[0], “foo”)
def test_cdata_is_never_formatted(self):
“””Text inside a CData object is passed into the formatter.
But the return value is ignored.
“””
self.count = 0
def increment(*args):
self.count += 1
return “BITTER FAILURE”
soup = self.soup(“”)
cdata = CData(“]]>”, soup.encode(formatter=increment))
self.assertEqual(1, self.count)
def test_doctype_ends_in_newline(self):
# Unlike other NavigableString subclasses, a DOCTYPE always ends
# in a newline.
doctype = Doctype(“foo”)
soup = self.soup(“”)
soup.insert(1, doctype)
self.assertEqual(soup.encode(), b”\n”)
class TestSoupSelector(TreeTest):
HTML = “””
[removed]
An H1
Some text
Some more text
An H2
Another
Bob
Another H2
me
span1a1
span1a2 test
span2a1
English
English UK
English US
French
“””
def setUp(self):
self.soup = BeautifulSoup(self.HTML)
def assertSelects(self, selector, expected_ids):
el_ids = [el[‘id’] for el in self.soup.select(selector)]
el_ids.sort()
expected_ids.sort()
self.assertEqual(expected_ids, el_ids,
“Selector %s, expected [%s], got [%s]” % (
selector, ‘, ‘.join(expected_ids), ‘, ‘.join(el_ids)
)
)
assertSelect = assertSelects
def assertSelectMultiple(self, *tests):
for selector, expected_ids in tests:
self.assertSelect(selector, expected_ids)
def test_one_tag_one(self):
els = self.soup.select(‘title’)
self.assertEqual(len(els), 1)
self.assertEqual(els[0].name, ‘title’)
self.assertEqual(els[0].contents, [u’The title’])
def test_one_tag_many(self):
els = self.soup.select(‘div’)
self.assertEqual(len(els), 3)
for div in els:
self.assertEqual(div.name, ‘div’)
def test_tag_in_tag_one(self):
els = self.soup.select(‘div div’)
self.assertSelects(‘div div’, [‘inner’])
def test_tag_in_tag_many(self):
for selector in (‘html div’, ‘html body div’, ‘body div’):
self.assertSelects(selector, [‘main’, ‘inner’, ‘footer’])
def test_tag_no_match(self):
self.assertEqual(len(self.soup.select(‘del’)), 0)
def test_invalid_tag(self):
self.assertEqual(len(self.soup.select(‘tag%t’)), 0)
def test_header_tags(self):
self.assertSelectMultiple(
(‘h1’, [‘header1’]),
(‘h2’, [‘header2’, ‘header3’]),
)
def test_class_one(self):
for selector in (‘.onep’, ‘p.onep’, ‘html p.onep’):
els = self.soup.select(selector)
self.assertEqual(len(els), 1)
self.assertEqual(els[0].name, ‘p’)
self.assertEqual(els[0][‘class’], [‘onep’])
def test_class_mismatched_tag(self):
els = self.soup.select(‘div.onep’)
self.assertEqual(len(els), 0)
def test_one_id(self):
for selector in (‘div#inner’, ‘#inner’, ‘div div#inner’):
self.assertSelects(selector, [‘inner’])
def test_bad_id(self):
els = self.soup.select(‘#doesnotexist’)
self.assertEqual(len(els), 0)
def test_items_in_id(self):
els = self.soup.select(‘div#inner p’)
self.assertEqual(len(els), 3)
for el in els:
self.assertEqual(el.name, ‘p’)
self.assertEqual(els[1][‘class’], [‘onep’])
self.assertFalse(els[0].has_key(‘class’))
def test_a_bunch_of_emptys(self):
for selector in (‘div#main del’, ‘div#main div.oops’, ‘div div#main’):
self.assertEqual(len(self.soup.select(selector)), 0)
def test_multi_class_support(self):
for selector in (‘.class1’, ‘p.class1’, ‘.class2’, ‘p.class2’,
‘.class3’, ‘p.class3’, ‘html p.class2’, ‘div#inner .class2’):
self.assertSelects(selector, [‘pmulti’])
def test_multi_class_selection(self):
for selector in (‘.class1.class3’, ‘.class3.class2’,
‘.class1.class2.class3’):
self.assertSelects(selector, [‘pmulti’])
def test_child_selector(self):
self.assertSelects(‘.s1 > a’, [‘s1a1’, ‘s1a2’])
self.assertSelects(‘.s1 > a span’, [‘s1a2s1’])
def test_attribute_equals(self):
self.assertSelectMultiple(
(‘p[class=”onep”]’, [‘p1’]),
(‘p[id=”p1″]’, [‘p1’]),
(‘[class=”onep”]’, [‘p1’]),
(‘[id=”p1″]’, [‘p1’]),
(‘link[rel=”stylesheet”]’, [‘l1’]),
(‘link[type=”text/css”]’, [‘l1’]),
(‘link[href=”blah.css”]’, [‘l1’]),
(‘link[href=”no-blah.css”]’, []),
(‘[rel=”stylesheet”]’, [‘l1’]),
(‘[type=”text/css”]’, [‘l1’]),
(‘[href=”blah.css”]’, [‘l1’]),
(‘[href=”no-blah.css”]’, []),
(‘p[href=”no-blah.css”]’, []),
(‘[href=”no-blah.css”]’, []),
)
def test_attribute_tilde(self):
self.assertSelectMultiple(
(‘p[class~=”class1″]’, [‘pmulti’]),
(‘p[class~=”class2″]’, [‘pmulti’]),
(‘p[class~=”class3″]’, [‘pmulti’]),
(‘[class~=”class1″]’, [‘pmulti’]),
(‘[class~=”class2″]’, [‘pmulti’]),
(‘[class~=”class3″]’, [‘pmulti’]),
(‘a[rel~=”friend”]’, [‘bob’]),
(‘a[rel~=”met”]’, [‘bob’]),
(‘[rel~=”friend”]’, [‘bob’]),
(‘[rel~=”met”]’, [‘bob’]),
)
def test_attribute_startswith(self):
self.assertSelectMultiple(
(‘[rel^=”style”]’, [‘l1’]),
(‘link[rel^=”style”]’, [‘l1’]),
(‘notlink[rel^=”notstyle”]’, []),
(‘[rel^=”notstyle”]’, []),
(‘link[rel^=”notstyle”]’, []),
(‘link[href^=”bla”]’, [‘l1’]),
(‘a[href^=”http://”]’, [‘bob’, ‘me’]),
(‘[href^=”http://”]’, [‘bob’, ‘me’]),
(‘[id^=”p”]’, [‘pmulti’, ‘p1’]),
(‘[id^=”m”]’, [‘me’, ‘main’]),
(‘div[id^=”m”]’, [‘main’]),
(‘a[id^=”m”]’, [‘me’]),
)
def test_attribute_endswith(self):
self.assertSelectMultiple(
(‘[href$=”.css”]’, [‘l1’]),
(‘link[href$=”.css”]’, [‘l1’]),
(‘link[id$=”1″]’, [‘l1’]),
(‘[id$=”1″]’, [‘l1’, ‘p1’, ‘header1’, ‘s1a1’, ‘s2a1’, ‘s1a2s1’]),
(‘div[id$=”1″]’, []),
(‘[id$=”noending”]’, []),
)
def test_attribute_contains(self):
self.assertSelectMultiple(
# From test_attribute_startswith
(‘[rel*=”style”]’, [‘l1’]),
(‘link[rel*=”style”]’, [‘l1’]),
(‘notlink[rel*=”notstyle”]’, []),
(‘[rel*=”notstyle”]’, []),
(‘link[rel*=”notstyle”]’, []),
(‘link[href*=”bla”]’, [‘l1’]),
(‘a[href*=”http://”]’, [‘bob’, ‘me’]),
(‘[href*=”http://”]’, [‘bob’, ‘me’]),
(‘[id*=”p”]’, [‘pmulti’, ‘p1’]),
(‘div[id*=”m”]’, [‘main’]),
(‘a[id*=”m”]’, [‘me’]),
# From test_attribute_endswith
(‘[href*=”.css”]’, [‘l1’]),
(‘link[href*=”.css”]’, [‘l1’]),
(‘link[id*=”1″]’, [‘l1’]),
(‘[id*=”1″]’, [‘l1’, ‘p1’, ‘header1’, ‘s1a1’, ‘s1a2’, ‘s2a1’, ‘s1a2s1’]),
(‘div[id*=”1″]’, []),
(‘[id*=”noending”]’, []),
# New for this test
(‘[href*=”.”]’, [‘bob’, ‘me’, ‘l1’]),
(‘a[href*=”.”]’, [‘bob’, ‘me’]),
(‘link[href*=”.”]’, [‘l1’]),
(‘div[id*=”n”]’, [‘main’, ‘inner’]),
(‘div[id*=”nn”]’, [‘inner’]),
)
def test_attribute_exact_or_hypen(self):
self.assertSelectMultiple(
(‘p[lang|=”en”]’, [‘lang-en’, ‘lang-en-gb’, ‘lang-en-us’]),
(‘[lang|=”en”]’, [‘lang-en’, ‘lang-en-gb’, ‘lang-en-us’]),
(‘p[lang|=”fr”]’, [‘lang-fr’]),
(‘p[lang|=”gb”]’, []),
)
def test_attribute_exists(self):
self.assertSelectMultiple(
(‘[rel]’, [‘l1’, ‘bob’, ‘me’]),
(‘link[rel]’, [‘l1’]),
(‘a[rel]’, [‘bob’, ‘me’]),
(‘[lang]’, [‘lang-en’, ‘lang-en-gb’, ‘lang-en-us’, ‘lang-fr’]),
(‘p[class]’, [‘p1’, ‘pmulti’]),
(‘[blah]’, []),
(‘p[blah]’, []),
)
def test_select_on_element(self):
# Other tests operate on the tree; this operates on an element
# within the tree.
inner = self.soup.find(“div”, id=”main”)
selected = inner.select(“div”)
# The tag was selected. The
# tag was not.
self.assertSelectsIDs(selected, [‘inner’])
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/bs4/tests/__init__.py
“The beautifulsoup tests.”
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/bs4/__init__.py
“””Beautiful Soup
Elixir and Tonic
“The Screen-Scraper’s Friend”
http://www.crummy.com/software/BeautifulSoup/
Beautiful Soup uses a pluggable XML or HTML parser to parse a
(possibly invalid) document into a tree representation. Beautiful Soup
provides provides methods and Pythonic idioms that make it easy to
navigate, search, and modify the parse tree.
Beautiful Soup works with Python 2.6 and up. It works better if lxml
and/or html5lib is installed.
For more than you ever wanted to know about Beautiful Soup, see the
documentation:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
“””
__author__ = “Leonard Richardson (leonardr@segfault.org)”
__version__ = “4.1.0”
__copyright__ = “Copyright (c) 2004-2012 Leonard Richardson”
__license__ = “MIT”
__all__ = [‘BeautifulSoup’]
import re
import warnings
from .builder import builder_registry
from .dammit import UnicodeDammit
from .element import (
CData,
Comment,
DEFAULT_OUTPUT_ENCODING,
Declaration,
Doctype,
NavigableString,
PageElement,
ProcessingInstruction,
ResultSet,
SoupStrainer,
Tag,
)
# The very first thing we do is give a useful error if someone is
# running this code under Python 3 without converting it.
syntax_error = u’You are trying to run the Python 2 version of Beautiful Soup under Python 3. This will not work. You need to convert the code, either by installing it (`python setup.py install`) or by running 2to3 (`2to3 -w bs4`).’
class BeautifulSoup(Tag):
“””
This class defines the basic interface called by the tree builders.
These methods will be called by the parser:
reset()
feed(markup)
The tree builder may call these methods from its feed() implementation:
handle_starttag(name, attrs) # See note about return value
handle_endtag(name)
handle_data(data) # Appends to the current data node
endData(containerClass=NavigableString) # Ends the current data node
No matter how complicated the underlying parser is, you should be
able to build a tree using ‘start tag’ events, ‘end tag’ events,
‘data’ events, and “done with data” events.
If you encounter an empty-element tag (aka a self-closing tag,
like HTML’s
tag), call handle_starttag and then
handle_endtag.
“””
ROOT_TAG_NAME = u'[document]’
# If the end-user gives no indication which tree builder they
# want, look for one with these features.
DEFAULT_BUILDER_FEATURES = [‘html’, ‘fast’]
# Used when determining whether a text node is all whitespace and
# can be replaced with a single space. A text node that contains
# fancy Unicode spaces (usually non-breaking) should be left
# alone.
STRIP_ASCII_SPACES = {9: None, 10: None, 12: None, 13: None, 32: None, }
def __init__(self, markup=””, features=None, builder=None,
parse_only=None, from_encoding=None, **kwargs):
“””The Soup object is initialized as the ‘root tag’, and the
provided markup (which can be a string or a file-like object)
is fed into the underlying parser.”””
if ‘convertEntities’ in kwargs:
warnings.warn(
“BS4 does not respect the convertEntities argument to the ”
“BeautifulSoup constructor. Entities are always converted ”
“to Unicode characters.”)
if ‘markupMassage’ in kwargs:
del kwargs[‘markupMassage’]
warnings.warn(
“BS4 does not respect the markupMassage argument to the ”
“BeautifulSoup constructor. The tree builder is responsible ”
“for any necessary markup massage.”)
if ‘smartQuotesTo’ in kwargs:
del kwargs[‘smartQuotesTo’]
warnings.warn(
“BS4 does not respect the smartQuotesTo argument to the ”
“BeautifulSoup constructor. Smart quotes are always converted ”
“to Unicode characters.”)
if ‘selfClosingTags’ in kwargs:
del kwargs[‘selfClosingTags’]
warnings.warn(
“BS4 does not respect the selfClosingTags argument to the ”
“BeautifulSoup constructor. The tree builder is responsible ”
“for understanding self-closing tags.”)
if ‘isHTML’ in kwargs:
del kwargs[‘isHTML’]
warnings.warn(
“BS4 does not respect the isHTML argument to the ”
“BeautifulSoup constructor. You can pass in features=’html’ ”
“or features=’xml’ to get a builder capable of handling ”
“one or the other.”)
def deprecated_argument(old_name, new_name):
if old_name in kwargs:
warnings.warn(
‘The “%s” argument to the BeautifulSoup constructor ‘
‘has been renamed to “%s.”‘ % (old_name, new_name))
value = kwargs[old_name]
del kwargs[old_name]
return value
return None
parse_only = parse_only or deprecated_argument(
“parseOnlyThese”, “parse_only”)
from_encoding = from_encoding or deprecated_argument(
“fromEncoding”, “from_encoding”)
if len(kwargs) > 0:
arg = kwargs.keys().pop()
raise TypeError(
“__init__() got an unexpected keyword argument ‘%s'” % arg)
if builder is None:
if isinstance(features, basestring):
features = [features]
if features is None or len(features) == 0:
features = self.DEFAULT_BUILDER_FEATURES
builder_class = builder_registry.lookup(*features)
if builder_class is None:
raise ValueError(
“Couldn’t find a tree builder with the features you ”
“requested: %s. Do you need to install a parser library?”
% “,”.join(features))
builder = builder_class()
self.builder = builder
self.is_xml = builder.is_xml
self.builder.soup = self
self.parse_only = parse_only
self.reset()
if hasattr(markup, ‘read’): # It’s a file-type object.
markup = markup.read()
(self.markup, self.original_encoding, self.declared_html_encoding,
self.contains_replacement_characters) = (
self.builder.prepare_markup(markup, from_encoding))
try:
self._feed()
except StopParsing:
pass
# Clear out the markup and remove the builder’s circular
# reference to this object.
self.markup = None
self.builder.soup = None
def _feed(self):
# Convert the document to Unicode.
self.builder.reset()
self.builder.feed(self.markup)
# Close out any unfinished strings and close all the open tags.
self.endData()
while self.currentTag.name != self.ROOT_TAG_NAME:
self.popTag()
def reset(self):
Tag.__init__(self, self, self.builder, self.ROOT_TAG_NAME)
self.hidden = 1
self.builder.reset()
self.currentData = []
self.currentTag = None
self.tagStack = []
self.pushTag(self)
def new_tag(self, name, namespace=None, nsprefix=None, **attrs):
“””Create a new tag associated with this soup.”””
return Tag(None, self.builder, name, namespace, nsprefix, attrs)
def new_string(self, s):
“””Create a new NavigableString associated with this soup.”””
navigable = NavigableString(s)
navigable.setup()
return navigable
def insert_before(self, successor):
raise ValueError(“BeautifulSoup objects don’t support insert_before().”)
def insert_after(self, successor):
raise ValueError(“BeautifulSoup objects don’t support insert_after().”)
def popTag(self):
tag = self.tagStack.pop()
#print “Pop”, tag.name
if self.tagStack:
self.currentTag = self.tagStack[-1]
return self.currentTag
def pushTag(self, tag):
#print “Push”, tag.name
if self.currentTag:
self.currentTag.contents.append(tag)
self.tagStack.append(tag)
self.currentTag = self.tagStack[-1]
def endData(self, containerClass=NavigableString):
if self.currentData:
currentData = u”.join(self.currentData)
if (currentData.translate(self.STRIP_ASCII_SPACES) == ” and
not set([tag.name for tag in self.tagStack]).intersection(
self.builder.preserve_whitespace_tags)):
if ‘\n’ in currentData:
currentData = ‘\n’
else:
currentData = ‘ ‘
self.currentData = []
if self.parse_only and len(self.tagStack) <= 1 and \
(not self.parse_only.text or \
not self.parse_only.search(currentData)):
return
o = containerClass(currentData)
self.object_was_parsed(o)
def object_was_parsed(self, o):
"""Add an object to the parse tree."""
o.setup(self.currentTag, self.previous_element)
if self.previous_element:
self.previous_element.next_element = o
self.previous_element = o
self.currentTag.contents.append(o)
def _popToTag(self, name, nsprefix=None, inclusivePop=True):
"""Pops the tag stack up to and including the most recent
instance of the given tag. If inclusivePop is false, pops the tag
stack up to but *not* including the most recent instqance of
the given tag."""
#print "Popping to %s" % name
if name == self.ROOT_TAG_NAME:
return
numPops = 0
mostRecentTag = None
for i in range(len(self.tagStack) - 1, 0, -1):
if (name == self.tagStack[i].name
and nsprefix == self.tagStack[i].nsprefix == nsprefix):
numPops = len(self.tagStack) - i
break
if not inclusivePop:
numPops = numPops - 1
for i in range(0, numPops):
mostRecentTag = self.popTag()
return mostRecentTag
def handle_starttag(self, name, namespace, nsprefix, attrs):
"""Push a start tag on to the stack.
If this method returns None, the tag was rejected by the
SoupStrainer. You should proceed as if the tag had not occured
in the document. For instance, if this was a self-closing tag,
don't call handle_endtag.
"""
# print "Start tag %s: %s" % (name, attrs)
self.endData()
if (self.parse_only and len(self.tagStack) <= 1
and (self.parse_only.text
or not self.parse_only.search_tag(name, attrs))):
return None
tag = Tag(self, self.builder, name, namespace, nsprefix, attrs,
self.currentTag, self.previous_element)
if tag is None:
return tag
if self.previous_element:
self.previous_element.next_element = tag
self.previous_element = tag
self.pushTag(tag)
return tag
def handle_endtag(self, name, nsprefix=None):
#print "End tag: " + name
self.endData()
self._popToTag(name, nsprefix)
def handle_data(self, data):
self.currentData.append(data)
def decode(self, pretty_print=False,
eventual_encoding=DEFAULT_OUTPUT_ENCODING,
formatter="minimal"):
"""Returns a string or Unicode representation of this document.
To get Unicode, pass None for encoding."""
if self.is_xml:
# Print the XML declaration
encoding_part = ''
if eventual_encoding != None:
encoding_part = ' encoding="%s"' % eventual_encoding
prefix = u'\n’ % encoding_part
else:
prefix = u”
if not pretty_print:
indent_level = None
else:
indent_level = 0
return prefix + super(BeautifulSoup, self).decode(
indent_level, eventual_encoding, formatter)
class BeautifulStoneSoup(BeautifulSoup):
“””Deprecated interface to an XML parser.”””
def __init__(self, *args, **kwargs):
kwargs[‘features’] = ‘xml’
warnings.warn(
‘The BeautifulStoneSoup class is deprecated. Instead of using ‘
‘it, pass features=”xml” into the BeautifulSoup constructor.’)
super(BeautifulStoneSoup, self).__init__(*args, **kwargs)
class StopParsing(Exception):
pass
#By default, act as an HTML pretty-printer.
if __name__ == ‘__main__’:
import sys
soup = BeautifulSoup(sys.stdin)
print soup.prettify()
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/build/lib/bs4/builder/_html5lib.py__all__ = [
‘HTML5TreeBuilder’,
]
import warnings
from bs4.builder import (
PERMISSIVE,
HTML,
HTML_5,
HTMLTreeBuilder,
)
from bs4.element import NamespacedAttribute
import html5lib
from html5lib.constants import namespaces
from bs4.element import (
Comment,
Doctype,
NavigableString,
Tag,
)
class HTML5TreeBuilder(HTMLTreeBuilder):
“””Use html5lib to build a tree.”””
features = [‘html5lib’, PERMISSIVE, HTML_5, HTML]
def prepare_markup(self, markup, user_specified_encoding):
# Store the user-specified encoding for use later on.
self.user_specified_encoding = user_specified_encoding
return markup, None, None, False
# These methods are defined by Beautiful Soup.
def feed(self, markup):
if self.soup.parse_only is not None:
warnings.warn(“You provided a value for parse_only, but the html5lib tree builder doesn’t support parse_only. The entire document will be parsed.”)
parser = html5lib.HTMLParser(tree=self.create_treebuilder)
doc = parser.parse(markup, encoding=self.user_specified_encoding)
# Set the character encoding detected by the tokenizer.
if isinstance(markup, str):
# We need to special-case this because html5lib sets
# charEncoding to UTF-8 if it gets Unicode input.
doc.original_encoding = None
else:
doc.original_encoding = parser.tokenizer.stream.charEncoding[0]
def create_treebuilder(self, namespaceHTMLElements):
self.underlying_builder = TreeBuilderForHtml5lib(
self.soup, namespaceHTMLElements)
return self.underlying_builder
def test_fragment_to_document(self, fragment):
“””See `TreeBuilder`.”””
return ‘%s’ % fragment
class TreeBuilderForHtml5lib(html5lib.treebuilders._base.TreeBuilder):
def __init__(self, soup, namespaceHTMLElements):
self.soup = soup
super(TreeBuilderForHtml5lib, self).__init__(namespaceHTMLElements)
def documentClass(self):
self.soup.reset()
return Element(self.soup, self.soup, None)
def insertDoctype(self, token):
name = token[“name”]
publicId = token[“publicId”]
systemId = token[“systemId”]
doctype = Doctype.for_name_and_ids(name, publicId, systemId)
self.soup.object_was_parsed(doctype)
def elementClass(self, name, namespace):
tag = self.soup.new_tag(name, namespace)
return Element(tag, self.soup, namespace)
def commentClass(self, data):
return TextNode(Comment(data), self.soup)
def fragmentClass(self):
self.soup = BeautifulSoup(“”)
self.soup.name = “[document_fragment]”
return Element(self.soup, self.soup, None)
def appendChild(self, node):
# XXX This code is not covered by the BS4 tests.
self.soup.append(node.element)
def getDocument(self):
return self.soup
def getFragment(self):
return html5lib.treebuilders._base.TreeBuilder.getFragment(self).element
class AttrList(object):
def __init__(self, element):
self.element = element
self.attrs = dict(self.element.attrs)
def __iter__(self):
return list(self.attrs.items()).__iter__()
def __setitem__(self, name, value):
“set attr”, name, value
self.element[name] = value
def items(self):
return list(self.attrs.items())
def keys(self):
return list(self.attrs.keys())
def __len__(self):
return len(self.attrs)
def __getitem__(self, name):
return self.attrs[name]
def __contains__(self, name):
return name in list(self.attrs.keys())
class Element(html5lib.treebuilders._base.Node):
def __init__(self, element, soup, namespace):
html5lib.treebuilders._base.Node.__init__(self, element.name)
self.element = element
self.soup = soup
self.namespace = namespace
def appendChild(self, node):
if (node.element.__class__ == NavigableString and self.element.contents
and self.element.contents[-1].__class__ == NavigableString):
# Concatenate new text onto old text node
# XXX This has O(n^2) performance, for input like
# “aaa…”
old_element = self.element.contents[-1]
new_element = self.soup.new_string(old_element + node.element)
old_element.replace_with(new_element)
else:
self.element.append(node.element)
node.parent = self
def getAttributes(self):
return AttrList(self.element)
def setAttributes(self, attributes):
if attributes is not None and len(attributes) > 0:
converted_attributes = []
for name, value in list(attributes.items()):
if isinstance(name, tuple):
new_name = NamespacedAttribute(*name)
del attributes[name]
attributes[new_name] = value
self.soup.builder._replace_cdata_list_attribute_values(
self.name, attributes)
for name, value in list(attributes.items()):
self.element[name] = value
# The attributes may contain variables that need substitution.
# Call set_up_substitutions manually.
#
# The Tag constructor called this method when the Tag was created,
# but we just set/changed the attributes, so call it again.
self.soup.builder.set_up_substitutions(self.element)
attributes = property(getAttributes, setAttributes)
def insertText(self, data, insertBefore=None):
text = TextNode(self.soup.new_string(data), self.soup)
if insertBefore:
self.insertBefore(text, insertBefore)
else:
self.appendChild(text)
def insertBefore(self, node, refNode):
index = self.element.index(refNode.element)
if (node.element.__class__ == NavigableString and self.element.contents
and self.element.contents[index-1].__class__ == NavigableString):
# (See comments in appendChild)
old_node = self.element.contents[index-1]
new_str = self.soup.new_string(old_node + node.element)
old_node.replace_with(new_str)
else:
self.element.insert(index, node.element)
node.parent = self
def removeChild(self, node):
node.element.extract()
def reparentChildren(self, newParent):
while self.element.contents:
child = self.element.contents[0]
child.extract()
if isinstance(child, Tag):
newParent.appendChild(
Element(child, self.soup, namespaces[“html”]))
else:
newParent.appendChild(
TextNode(child, self.soup))
def cloneNode(self):
tag = self.soup.new_tag(self.element.name, self.namespace)
node = Element(tag, self.soup, self.namespace)
for key,value in self.attributes:
node.attributes[key] = value
return node
def hasContent(self):
return self.element.contents
def getNameTuple(self):
if self.namespace == None:
return namespaces[“html”], self.name
else:
return self.namespace, self.name
nameTuple = property(getNameTuple)
class TextNode(Element):
def __init__(self, element, soup):
html5lib.treebuilders._base.Node.__init__(self, None)
self.element = element
self.soup = soup
def cloneNode(self):
raise NotImplementedError
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/build/lib/bs4/builder/_htmlparser.py
“””Use the HTMLParser library to parse HTML files that aren’t too bad.”””
__all__ = [
‘HTMLParserTreeBuilder’,
]
from html.parser import (
HTMLParser,
HTMLParseError,
)
import sys
import warnings
# Starting in Python 3.2, the HTMLParser constructor takes a ‘strict’
# argument, which we’d like to set to False. Unfortunately,
# http://bugs.python.org/issue13273 makes strict=True a better bet
# before Python 3.2.3.
#
# At the end of this file, we monkeypatch HTMLParser so that
# strict=True works well on Python 3.2.2.
major, minor, release = sys.version_info[:3]
CONSTRUCTOR_TAKES_STRICT = (
major > 3
or (major == 3 and minor > 2)
or (major == 3 and minor == 2 and release >= 3))
from bs4.element import (
CData,
Comment,
Declaration,
Doctype,
ProcessingInstruction,
)
from bs4.dammit import EntitySubstitution, UnicodeDammit
from bs4.builder import (
HTML,
HTMLTreeBuilder,
STRICT,
)
HTMLPARSER = ‘html.parser’
class BeautifulSoupHTMLParser(HTMLParser):
def handle_starttag(self, name, attrs):
# XXX namespace
self.soup.handle_starttag(name, None, None, dict(attrs))
def handle_endtag(self, name):
self.soup.handle_endtag(name)
def handle_data(self, data):
self.soup.handle_data(data)
def handle_charref(self, name):
# XXX workaround for a bug in HTMLParser. Remove this once
# it’s fixed.
if name.startswith(‘x’):
real_name = int(name.lstrip(‘x’), 16)
else:
real_name = int(name)
try:
data = chr(real_name)
except (ValueError, OverflowError) as e:
data = “\N{REPLACEMENT CHARACTER}”
self.handle_data(data)
def handle_entityref(self, name):
character = EntitySubstitution.HTML_ENTITY_TO_CHARACTER.get(name)
if character is not None:
data = character
else:
data = “&%s;” % name
self.handle_data(data)
def handle_comment(self, data):
self.soup.endData()
self.soup.handle_data(data)
self.soup.endData(Comment)
def handle_decl(self, data):
self.soup.endData()
if data.startswith(“DOCTYPE “):
data = data[len(“DOCTYPE “):]
self.soup.handle_data(data)
self.soup.endData(Doctype)
def unknown_decl(self, data):
if data.upper().startswith(‘CDATA[‘):
cls = CData
data = data[len(‘CDATA[‘):]
else:
cls = Declaration
self.soup.endData()
self.soup.handle_data(data)
self.soup.endData(cls)
def handle_pi(self, data):
self.soup.endData()
if data.endswith(“?”) and data.lower().startswith(“xml”):
# “An XHTML processing instruction using the trailing ‘?’
# will cause the ‘?’ to be included in data.” – HTMLParser
# docs.
#
# Strip the question mark so we don’t end up with two
# question marks.
data = data[:-1]
self.soup.handle_data(data)
self.soup.endData(ProcessingInstruction)
class HTMLParserTreeBuilder(HTMLTreeBuilder):
is_xml = False
features = [HTML, STRICT, HTMLPARSER]
def __init__(self, *args, **kwargs):
if CONSTRUCTOR_TAKES_STRICT:
kwargs[‘strict’] = False
self.parser_args = (args, kwargs)
def prepare_markup(self, markup, user_specified_encoding=None,
document_declared_encoding=None):
“””
:return: A 4-tuple (markup, original encoding, encoding
declared within markup, whether any characters had to be
replaced with REPLACEMENT CHARACTER).
“””
if isinstance(markup, str):
return markup, None, None, False
try_encodings = [user_specified_encoding, document_declared_encoding]
dammit = UnicodeDammit(markup, try_encodings, is_html=True)
return (dammit.markup, dammit.original_encoding,
dammit.declared_html_encoding,
dammit.contains_replacement_characters)
def feed(self, markup):
args, kwargs = self.parser_args
parser = BeautifulSoupHTMLParser(*args, **kwargs)
parser.soup = self.soup
try:
parser.feed(markup)
except HTMLParseError as e:
warnings.warn(RuntimeWarning(
“Python’s built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help.”))
raise e
# Patch 3.2 versions of HTMLParser earlier than 3.2.3 to use some
# 3.2.3 code. This ensures they don’t treat markup like as a
# string.
#
# XXX This code can be removed once most Python 3 users are on 3.2.3.
if major == 3 and minor == 2 and not CONSTRUCTOR_TAKES_STRICT:
import re
attrfind_tolerant = re.compile(
r’\s*((?<=[\'"\s])[^\s/>][^\s/=>]*)(\s*=+\s*’
r'(\'[^\’]*\’|”[^”]*”|(?![\'”])[^>\s]*))?’)
HTMLParserTreeBuilder.attrfind_tolerant = attrfind_tolerant
locatestarttagend = re.compile(r”””
<[a-zA-Z][-.a-zA-Z0-9:_]* # tag name
(?:\s+ # whitespace before attribute name
(?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name
(?:\s*=\s* # value indicator
(?:'[^']*' # LITA-enclosed value
|\"[^\"]*\" # LIT-enclosed value
|[^'\">\s]+ # bare value
)
)?
)
)*
\s* # trailing whitespace
“””, re.VERBOSE)
BeautifulSoupHTMLParser.locatestarttagend = locatestarttagend
from html.parser import tagfind, attrfind
def parse_starttag(self, i):
self.__starttag_text = None
endpos = self.check_for_whole_start_tag(i)
if endpos < 0:
return endpos
rawdata = self.rawdata
self.__starttag_text = rawdata[i:endpos]
# Now parse the data between i+1 and j into a tag and attrs
attrs = []
match = tagfind.match(rawdata, i+1)
assert match, 'unexpected call to parse_starttag()'
k = match.end()
self.lasttag = tag = rawdata[i+1:k].lower()
while k < endpos:
if self.strict:
m = attrfind.match(rawdata, k)
else:
m = attrfind_tolerant.match(rawdata, k)
if not m:
break
attrname, rest, attrvalue = m.group(1, 2, 3)
if not rest:
attrvalue = None
elif attrvalue[:1] == '\'' == attrvalue[-1:] or \
attrvalue[:1] == '"' == attrvalue[-1:]:
attrvalue = attrvalue[1:-1]
if attrvalue:
attrvalue = self.unescape(attrvalue)
attrs.append((attrname.lower(), attrvalue))
k = m.end()
end = rawdata[k:endpos].strip()
if end not in (">“, “/>”):
lineno, offset = self.getpos()
if “\n” in self.__starttag_text:
lineno = lineno + self.__starttag_text.count(“\n”)
offset = len(self.__starttag_text) \
– self.__starttag_text.rfind(“\n”)
else:
offset = offset + len(self.__starttag_text)
if self.strict:
self.error(“junk characters in start tag: %r”
% (rawdata[k:endpos][:20],))
self.handle_data(rawdata[i:endpos])
return endpos
if end.endswith(‘/>’):
# XHTML-style empty tag:
self.handle_startendtag(tag, attrs)
else:
self.handle_starttag(tag, attrs)
if tag in self.CDATA_CONTENT_ELEMENTS:
self.set_cdata_mode(tag)
return endpos
def set_cdata_mode(self, elem):
self.cdata_elem = elem.lower()
self.interesting = re.compile(r’\s*%s\s*>‘ % self.cdata_elem, re.I)
BeautifulSoupHTMLParser.parse_starttag = parse_starttag
BeautifulSoupHTMLParser.set_cdata_mode = set_cdata_mode
CONSTRUCTOR_TAKES_STRICT = True
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/build/lib/bs4/builder/_lxml.py__all__ = [
‘LXMLTreeBuilderForXML’,
‘LXMLTreeBuilder’,
]
from io import StringIO
import collections
from lxml import etree
from bs4.element import Comment, Doctype, NamespacedAttribute
from bs4.builder import (
FAST,
HTML,
HTMLTreeBuilder,
PERMISSIVE,
TreeBuilder,
XML)
from bs4.dammit import UnicodeDammit
LXML = ‘lxml’
class LXMLTreeBuilderForXML(TreeBuilder):
DEFAULT_PARSER_CLASS = etree.XMLParser
is_xml = True
# Well, it’s permissive by XML parser standards.
features = [LXML, XML, FAST, PERMISSIVE]
CHUNK_SIZE = 512
@property
def default_parser(self):
# This can either return a parser object or a class, which
# will be instantiated with default arguments.
return etree.XMLParser(target=self, strip_cdata=False, recover=True)
def __init__(self, parser=None, empty_element_tags=None):
if empty_element_tags is not None:
self.empty_element_tags = set(empty_element_tags)
if parser is None:
# Use the default parser.
parser = self.default_parser
if isinstance(parser, collections.Callable):
# Instantiate the parser with default arguments
parser = parser(target=self, strip_cdata=False)
self.parser = parser
self.soup = None
self.nsmaps = None
def _getNsTag(self, tag):
# Split the namespace URL out of a fully-qualified lxml tag
# name. Copied from lxml’s src/lxml/sax.py.
if tag[0] == ‘{‘:
return tuple(tag[1:].split(‘}’, 1))
else:
return (None, tag)
def prepare_markup(self, markup, user_specified_encoding=None,
document_declared_encoding=None):
“””
:return: A 3-tuple (markup, original encoding, encoding
declared within markup).
“””
if isinstance(markup, str):
return markup, None, None, False
try_encodings = [user_specified_encoding, document_declared_encoding]
dammit = UnicodeDammit(markup, try_encodings, is_html=True)
return (dammit.markup, dammit.original_encoding,
dammit.declared_html_encoding,
dammit.contains_replacement_characters)
def feed(self, markup):
if isinstance(markup, str):
markup = StringIO(markup)
# Call feed() at least once, even if the markup is empty,
# or the parser won’t be initialized.
data = markup.read(self.CHUNK_SIZE)
self.parser.feed(data)
while data != ”:
# Now call feed() on the rest of the data, chunk by chunk.
data = markup.read(self.CHUNK_SIZE)
if data != ”:
self.parser.feed(data)
self.parser.close()
def close(self):
self.nsmaps = None
def start(self, name, attrs, nsmap={}):
# Make sure attrs is a mutable dict–lxml may send an immutable dictproxy.
attrs = dict(attrs)
nsprefix = None
# Invert each namespace map as it comes in.
if len(nsmap) == 0 and self.nsmaps != None:
# There are no new namespaces for this tag, but namespaces
# are in play, so we need a separate tag stack to know
# when they end.
self.nsmaps.append(None)
elif len(nsmap) > 0:
# A new namespace mapping has come into play.
if self.nsmaps is None:
self.nsmaps = []
inverted_nsmap = dict((value, key) for key, value in list(nsmap.items()))
self.nsmaps.append(inverted_nsmap)
# Also treat the namespace mapping as a set of attributes on the
# tag, so we can recreate it later.
attrs = attrs.copy()
for prefix, namespace in list(nsmap.items()):
attribute = NamespacedAttribute(
“xmlns”, prefix, “http://www.w3.org/2000/xmlns/”)
attrs[attribute] = namespace
namespace, name = self._getNsTag(name)
if namespace is not None:
for inverted_nsmap in reversed(self.nsmaps):
if inverted_nsmap is not None and namespace in inverted_nsmap:
nsprefix = inverted_nsmap[namespace]
break
self.soup.handle_starttag(name, namespace, nsprefix, attrs)
def end(self, name):
self.soup.endData()
completed_tag = self.soup.tagStack[-1]
namespace, name = self._getNsTag(name)
nsprefix = None
if namespace is not None:
for inverted_nsmap in reversed(self.nsmaps):
if inverted_nsmap is not None and namespace in inverted_nsmap:
nsprefix = inverted_nsmap[namespace]
break
self.soup.handle_endtag(name, nsprefix)
if self.nsmaps != None:
# This tag, or one of its parents, introduced a namespace
# mapping, so pop it off the stack.
self.nsmaps.pop()
if len(self.nsmaps) == 0:
# Namespaces are no longer in play, so don’t bother keeping
# track of the namespace stack.
self.nsmaps = None
def pi(self, target, data):
pass
def data(self, content):
self.soup.handle_data(content)
def doctype(self, name, pubid, system):
self.soup.endData()
doctype = Doctype.for_name_and_ids(name, pubid, system)
self.soup.object_was_parsed(doctype)
def comment(self, content):
“Handle comments as Comment objects.”
self.soup.endData()
self.soup.handle_data(content)
self.soup.endData(Comment)
def test_fragment_to_document(self, fragment):
“””See `TreeBuilder`.”””
return ‘\n%s’ % fragment
class LXMLTreeBuilder(HTMLTreeBuilder, LXMLTreeBuilderForXML):
features = [LXML, HTML, FAST, PERMISSIVE]
is_xml = False
@property
def default_parser(self):
return etree.HTMLParser
def feed(self, markup):
self.parser.feed(markup)
self.parser.close()
def test_fragment_to_document(self, fragment):
“””See `TreeBuilder`.”””
return ‘%s’ % fragment
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/build/lib/bs4/builder/__init__.py
from collections import defaultdict
import itertools
import sys
from bs4.element import (
CharsetMetaAttributeValue,
ContentMetaAttributeValue,
whitespace_re
)
__all__ = [
‘HTMLTreeBuilder’,
‘SAXTreeBuilder’,
‘TreeBuilder’,
‘TreeBuilderRegistry’,
]
# Some useful features for a TreeBuilder to have.
FAST = ‘fast’
PERMISSIVE = ‘permissive’
STRICT = ‘strict’
XML = ‘xml’
HTML = ‘html’
HTML_5 = ‘html5’
class TreeBuilderRegistry(object):
def __init__(self):
self.builders_for_feature = defaultdict(list)
self.builders = []
def register(self, treebuilder_class):
“””Register a treebuilder based on its advertised features.”””
for feature in treebuilder_class.features:
self.builders_for_feature[feature].insert(0, treebuilder_class)
self.builders.insert(0, treebuilder_class)
def lookup(self, *features):
if len(self.builders) == 0:
# There are no builders at all.
return None
if len(features) == 0:
# They didn’t ask for any features. Give them the most
# recently registered builder.
return self.builders[0]
# Go down the list of features in order, and eliminate any builders
# that don’t match every feature.
features = list(features)
features.reverse()
candidates = None
candidate_set = None
while len(features) > 0:
feature = features.pop()
we_have_the_feature = self.builders_for_feature.get(feature, [])
if len(we_have_the_feature) > 0:
if candidates is None:
candidates = we_have_the_feature
candidate_set = set(candidates)
else:
# Eliminate any candidates that don’t have this feature.
candidate_set = candidate_set.intersection(
set(we_have_the_feature))
# The only valid candidates are the ones in candidate_set.
# Go through the original list of candidates and pick the first one
# that’s in candidate_set.
if candidate_set is None:
return None
for candidate in candidates:
if candidate in candidate_set:
return candidate
return None
# The BeautifulSoup class will take feature lists from developers and use them
# to look up builders in this registry.
builder_registry = TreeBuilderRegistry()
class TreeBuilder(object):
“””Turn a document into a Beautiful Soup object tree.”””
features = []
is_xml = False
preserve_whitespace_tags = set()
empty_element_tags = None # A tag will be considered an empty-element
# tag when and only when it has no contents.
# A value for these tag/attribute combinations is a space- or
# comma-separated list of CDATA, rather than a single CDATA.
cdata_list_attributes = {}
def __init__(self):
self.soup = None
def reset(self):
pass
def can_be_empty_element(self, tag_name):
“””Might a tag with this name be an empty-element tag?
The final markup may or may not actually present this tag as
self-closing.
For instance: an HTMLBuilder does not consider a
tag to be
an empty-element tag (it’s not in
HTMLBuilder.empty_element_tags). This means an empty
tag
will be presented as “”, not “
The default implementation has no opinion about which tags are
empty-element tags, so a tag will be presented as an
empty-element tag if and only if it has no contents.
“
be left alone.
“””
if self.empty_element_tags is None:
return True
return tag_name in self.empty_element_tags
def feed(self, markup):
raise NotImplementedError()
def prepare_markup(self, markup, user_specified_encoding=None,
document_declared_encoding=None):
return markup, None, None, False
def test_fragment_to_document(self, fragment):
“””Wrap an HTML fragment to make it look like a document.
Different parsers do this differently. For instance, lxml
introduces an empty tag, and html5lib
doesn’t. Abstracting this away lets us write simple tests
which run HTML fragments through the parser and compare the
results against other HTML fragments.
This method should not be used outside of tests.
“””
return fragment
def set_up_substitutions(self, tag):
return False
def _replace_cdata_list_attribute_values(self, tag_name, attrs):
“””Replaces class=”foo bar” with class=[“foo”, “bar”]
Modifies its input in place.
“””
if self.cdata_list_attributes:
universal = self.cdata_list_attributes.get(‘*’, [])
tag_specific = self.cdata_list_attributes.get(
tag_name.lower(), [])
for cdata_list_attr in itertools.chain(universal, tag_specific):
if cdata_list_attr in dict(attrs):
# Basically, we have a “class” attribute whose
# value is a whitespace-separated list of CSS
# classes. Split it into a list.
value = attrs[cdata_list_attr]
values = whitespace_re.split(value)
attrs[cdata_list_attr] = values
return attrs
class SAXTreeBuilder(TreeBuilder):
“””A Beautiful Soup treebuilder that listens for SAX events.”””
def feed(self, markup):
raise NotImplementedError()
def close(self):
pass
def startElement(self, name, attrs):
attrs = dict((key[1], value) for key, value in list(attrs.items()))
#print “Start %s, %r” % (name, attrs)
self.soup.handle_starttag(name, attrs)
def endElement(self, name):
#print “End %s” % name
self.soup.handle_endtag(name)
def startElementNS(self, nsTuple, nodeName, attrs):
# Throw away (ns, nodeName) for now.
self.startElement(nodeName, attrs)
def endElementNS(self, nsTuple, nodeName):
# Throw away (ns, nodeName) for now.
self.endElement(nodeName)
#handler.endElementNS((ns, node.nodeName), node.nodeName)
def startPrefixMapping(self, prefix, nodeValue):
# Ignore the prefix for now.
pass
def endPrefixMapping(self, prefix):
# Ignore the prefix for now.
# handler.endPrefixMapping(prefix)
pass
def characters(self, content):
self.soup.handle_data(content)
def startDocument(self):
pass
def endDocument(self):
pass
class HTMLTreeBuilder(TreeBuilder):
“””This TreeBuilder knows facts about HTML.
Such as which tags are empty-element tags.
“””
preserve_whitespace_tags = set([‘pre’, ‘textarea’])
empty_element_tags = set([‘br’ , ‘hr’, ‘input’, ‘img’, ‘meta’,
‘spacer’, ‘link’, ‘frame’, ‘base’])
# The HTML standard defines these attributes as containing a
# space-separated list of values, not a single value. That is,
# class=”foo bar” means that the ‘class’ attribute has two values,
# ‘foo’ and ‘bar’, not the single value ‘foo bar’. When we
# encounter one of these attributes, we will parse its value into
# a list of values if possible. Upon output, the list will be
# converted back into a string.
cdata_list_attributes = {
“*” : [‘class’, ‘accesskey’, ‘dropzone’],
“a” : [‘rel’, ‘rev’],
“link” : [‘rel’, ‘rev’],
“td” : [“headers”],
“th” : [“headers”],
“td” : [“headers”],
“form” : [“accept-charset”],
“object” : [“archive”],
# These are HTML5 specific, as are *.accesskey and *.dropzone above.
“area” : [“rel”],
“icon” : [“sizes”],
“iframe” : [“sandbox”],
“output” : [“for”],
}
def set_up_substitutions(self, tag):
# We are only interested in tags
if tag.name != ‘meta’:
return False
http_equiv = tag.get(‘http-equiv’)
content = tag.get(‘content’)
charset = tag.get(‘charset’)
# We are interested in tags that say what encoding the
# document was originally in. This means HTML 5-style
# tags that provide the “charset” attribute. It also means
# HTML 4-style tags that provide the “content”
# attribute and have “http-equiv” set to “content-type”.
#
# In both cases we will replace the value of the appropriate
# attribute with a standin object that can take on any
# encoding.
meta_encoding = None
if charset is not None:
# HTML 5 style:
#
meta_encoding = charset
tag[‘charset’] = CharsetMetaAttributeValue(charset)
elif (content is not None and http_equiv is not None
and http_equiv.lower() == ‘content-type’):
# HTML 4 style:
#
tag[‘content’] = ContentMetaAttributeValue(content)
return (meta_encoding is not None)
def register_treebuilders_from(module):
“””Copy TreeBuilders from the given module into this module.”””
# I’m fairly sure this is not the best way to do this.
this_module = sys.modules[‘bs4.builder’]
for name in module.__all__:
obj = getattr(module, name)
if issubclass(obj, TreeBuilder):
setattr(this_module, name, obj)
this_module.__all__.append(name)
# Register the builder while we’re at it.
this_module.builder_registry.register(obj)
# Builders are registered in reverse order of priority, so that custom
# builder registrations will take precedence. In general, we want lxml
# to take precedence over html5lib, because it’s faster. And we only
# want to use HTMLParser as a last result.
from . import _htmlparser
register_treebuilders_from(_htmlparser)
try:
from . import _html5lib
register_treebuilders_from(_html5lib)
except ImportError:
# They don’t have html5lib installed.
pass
try:
from . import _lxml
register_treebuilders_from(_lxml)
except ImportError:
# They don’t have lxml installed.
pass
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/build/lib/bs4/dammit.py
# -*- coding: utf-8 -*-
“””Beautiful Soup bonus library: Unicode, Dammit
This class forces XML data into a standard format (usually to UTF-8 or
Unicode). It is heavily based on code from Mark Pilgrim’s Universal
Feed Parser. It does not rewrite the XML or HTML to reflect a new
encoding; that’s the tree builder’s job.
“””
import codecs
from html.entities import codepoint2name
import re
import warnings
# Autodetects character encodings. Very useful.
# Download from http://chardet.feedparser.org/
# or ‘apt-get install python-chardet’
# or ‘easy_install chardet’
try:
import chardet
#import chardet.constants
#chardet.constants._debug = 1
except ImportError:
chardet = None
# Available from http://cjkpython.i18n.org/.
try:
import iconv_codec
except ImportError:
pass
xml_encoding_re = re.compile(
‘^<\?.*encoding=[\'"](.*?)[\'"].*\?>‘.encode(), re.I)
html_meta_re = re.compile(
‘<\s*meta[^>]+charset\s*=\s*[“\’]?([^>]*?)[ /;\'”>]’.encode(), re.I)
class EntitySubstitution(object):
“””Substitute XML or HTML entities for the corresponding characters.”””
def _populate_class_variables():
lookup = {}
reverse_lookup = {}
characters_for_re = []
for codepoint, name in list(codepoint2name.items()):
character = chr(codepoint)
if codepoint != 34:
# There’s no point in turning the quotation mark into
# ", unless it happens within an attribute value, which
# is handled elsewhere.
characters_for_re.append(character)
lookup[character] = name
# But we do want to turn " into the quotation mark.
reverse_lookup[name] = character
re_definition = “[%s]” % “”.join(characters_for_re)
return lookup, reverse_lookup, re.compile(re_definition)
(CHARACTER_TO_HTML_ENTITY, HTML_ENTITY_TO_CHARACTER,
CHARACTER_TO_HTML_ENTITY_RE) = _populate_class_variables()
CHARACTER_TO_XML_ENTITY = {
“‘”: “apos”,
‘”‘: “quot”,
“&”: “amp”,
“<": "lt",
">“: “gt”,
}
BARE_AMPERSAND_OR_BRACKET = re.compile(“([<>]|”
“&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;)”
“)”)
@classmethod
def _substitute_html_entity(cls, matchobj):
entity = cls.CHARACTER_TO_HTML_ENTITY.get(matchobj.group(0))
return “&%s;” % entity
@classmethod
def _substitute_xml_entity(cls, matchobj):
“””Used with a regular expression to substitute the
appropriate XML entity for an XML special character.”””
entity = cls.CHARACTER_TO_XML_ENTITY[matchobj.group(0)]
return “&%s;” % entity
@classmethod
def quoted_attribute_value(self, value):
“””Make a value into a quoted XML attribute, possibly escaping it.
Most strings will be quoted using double quotes.
Bob’s Bar -> “Bob’s Bar”
If a string contains double quotes, it will be quoted using
single quotes.
Welcome to “my bar” -> ‘Welcome to “my bar”‘
If a string contains both single and double quotes, the
double quotes will be escaped, and the string will be quoted
using double quotes.
Welcome to “Bob’s Bar” -> “Welcome to "Bob’s bar"
“””
quote_with = ‘”‘
if ‘”‘ in value:
if “‘” in value:
# The string contains both single and double
# quotes. Turn the double quotes into
# entities. We quote the double quotes rather than
# the single quotes because the entity name is
# “"” whether this is HTML or XML. If we
# quoted the single quotes, we’d have to decide
# between ' and &squot;.
replace_with = “"”
value = value.replace(‘”‘, replace_with)
else:
# There are double quotes but no single quotes.
# We can use single quotes to quote the attribute.
quote_with = “‘”
return quote_with + value + quote_with
@classmethod
def substitute_xml(cls, value, make_quoted_attribute=False):
“””Substitute XML entities for special XML characters.
:param value: A string to be substituted. The less-than sign will
become <, the greater-than sign will become >, and any
ampersands that are not part of an entity defition will
become &.
:param make_quoted_attribute: If True, then the string will be
quoted, as befits an attribute value.
“””
# Escape angle brackets, and ampersands that aren’t part of
# entities.
value = cls.BARE_AMPERSAND_OR_BRACKET.sub(
cls._substitute_xml_entity, value)
if make_quoted_attribute:
value = cls.quoted_attribute_value(value)
return value
@classmethod
def substitute_html(cls, s):
“””Replace certain Unicode characters with named HTML entities.
This differs from data.encode(encoding, ‘xmlcharrefreplace’)
in that the goal is to make the result more readable (to those
with ASCII displays) rather than to recover from
errors. There’s absolutely nothing wrong with a UTF-8 string
containg a LATIN SMALL LETTER E WITH ACUTE, but replacing that
character with “é” will make it more readable to some
people.
“””
return cls.CHARACTER_TO_HTML_ENTITY_RE.sub(
cls._substitute_html_entity, s)
class UnicodeDammit:
“””A class for detecting the encoding of a *ML document and
converting it to a Unicode string. If the source encoding is
windows-1252, can replace MS smart quotes with their HTML or XML
equivalents.”””
# This dictionary maps commonly seen values for “charset” in HTML
# meta tags to the corresponding Python codec names. It only covers
# values that aren’t in Python’s aliases and can’t be determined
# by the heuristics in find_codec.
CHARSET_ALIASES = {“macintosh”: “mac-roman”,
“x-sjis”: “shift-jis”}
ENCODINGS_WITH_SMART_QUOTES = [
“windows-1252”,
“iso-8859-1”,
“iso-8859-2″,
]
def __init__(self, markup, override_encodings=[],
smart_quotes_to=None, is_html=False):
self.declared_html_encoding = None
self.smart_quotes_to = smart_quotes_to
self.tried_encodings = []
self.contains_replacement_characters = False
if markup == ” or isinstance(markup, str):
self.markup = markup
self.unicode_markup = str(markup)
self.original_encoding = None
return
new_markup, document_encoding, sniffed_encoding = \
self._detectEncoding(markup, is_html)
self.markup = new_markup
u = None
if new_markup != markup:
# _detectEncoding modified the markup, then converted it to
# Unicode and then to UTF-8. So convert it from UTF-8.
u = self._convert_from(“utf8”)
self.original_encoding = sniffed_encoding
if not u:
for proposed_encoding in (
override_encodings + [document_encoding, sniffed_encoding]):
if proposed_encoding is not None:
u = self._convert_from(proposed_encoding)
if u:
break
# If no luck and we have auto-detection library, try that:
if not u and chardet and not isinstance(self.markup, str):
u = self._convert_from(chardet.detect(self.markup)[‘encoding’])
# As a last resort, try utf-8 and windows-1252:
if not u:
for proposed_encoding in (“utf-8”, “windows-1252”):
u = self._convert_from(proposed_encoding)
if u:
break
# As an absolute last resort, try the encodings again with
# character replacement.
if not u:
for proposed_encoding in (
override_encodings + [
document_encoding, sniffed_encoding, “utf-8”, “windows-1252”]):
if proposed_encoding != “ascii”:
u = self._convert_from(proposed_encoding, “replace”)
if u is not None:
warnings.warn(
UnicodeWarning(
“Some characters could not be decoded, and were ”
“replaced with REPLACEMENT CHARACTER.”))
self.contains_replacement_characters = True
break
# We could at this point force it to ASCII, but that would
# destroy so much data that I think giving up is better
self.unicode_markup = u
if not u:
self.original_encoding = None
def _sub_ms_char(self, match):
“””Changes a MS smart quote character to an XML or HTML
entity, or an ASCII character.”””
orig = match.group(1)
if self.smart_quotes_to == ‘ascii’:
sub = self.MS_CHARS_TO_ASCII.get(orig).encode()
else:
sub = self.MS_CHARS.get(orig)
if type(sub) == tuple:
if self.smart_quotes_to == ‘xml’:
sub = ‘&#x’.encode() + sub[1].encode() + ‘;’.encode()
else:
sub = ‘&’.encode() + sub[0].encode() + ‘;’.encode()
else:
sub = sub.encode()
return sub
def _convert_from(self, proposed, errors=”strict”):
proposed = self.find_codec(proposed)
if not proposed or (proposed, errors) in self.tried_encodings:
return None
self.tried_encodings.append((proposed, errors))
markup = self.markup
# Convert smart quotes to HTML if coming from an encoding
# that might have them.
if (self.smart_quotes_to is not None
and proposed.lower() in self.ENCODINGS_WITH_SMART_QUOTES):
smart_quotes_re = b”([\x80-\x9f])”
smart_quotes_compiled = re.compile(smart_quotes_re)
markup = smart_quotes_compiled.sub(self._sub_ms_char, markup)
try:
#print “Trying to convert document to %s (errors=%s)” % (
# proposed, errors)
u = self._to_unicode(markup, proposed, errors)
self.markup = u
self.original_encoding = proposed
except Exception as e:
#print “That didn’t work!”
#print e
return None
#print “Correct encoding: %s” % proposed
return self.markup
def _to_unicode(self, data, encoding, errors=”strict”):
”’Given a string and its encoding, decodes the string into Unicode.
%encoding is a string recognized by encodings.aliases”’
# strip Byte Order Mark (if present)
if (len(data) >= 4) and (data[:2] == ‘\xfe\xff’) \
and (data[2:4] != ‘\x00\x00’):
encoding = ‘utf-16be’
data = data[2:]
elif (len(data) >= 4) and (data[:2] == ‘\xff\xfe’) \
and (data[2:4] != ‘\x00\x00’):
encoding = ‘utf-16le’
data = data[2:]
elif data[:3] == ‘\xef\xbb\xbf’:
encoding = ‘utf-8’
data = data[3:]
elif data[:4] == ‘\x00\x00\xfe\xff’:
encoding = ‘utf-32be’
data = data[4:]
elif data[:4] == ‘\xff\xfe\x00\x00’:
encoding = ‘utf-32le’
data = data[4:]
newdata = str(data, encoding, errors)
return newdata
def _detectEncoding(self, xml_data, is_html=False):
“””Given a document, tries to detect its XML encoding.”””
xml_encoding = sniffed_xml_encoding = None
try:
if xml_data[:4] == b’\x4c\x6f\xa7\x94′:
# EBCDIC
xml_data = self._ebcdic_to_ascii(xml_data)
elif xml_data[:4] == b’\x00\x3c\x00\x3f’:
# UTF-16BE
sniffed_xml_encoding = ‘utf-16be’
xml_data = str(xml_data, ‘utf-16be’).encode(‘utf-8′)
elif (len(xml_data) >= 4) and (xml_data[:2] == b’\xfe\xff’) \
and (xml_data[2:4] != b’\x00\x00′):
# UTF-16BE with BOM
sniffed_xml_encoding = ‘utf-16be’
xml_data = str(xml_data[2:], ‘utf-16be’).encode(‘utf-8′)
elif xml_data[:4] == b’\x3c\x00\x3f\x00’:
# UTF-16LE
sniffed_xml_encoding = ‘utf-16le’
xml_data = str(xml_data, ‘utf-16le’).encode(‘utf-8′)
elif (len(xml_data) >= 4) and (xml_data[:2] == b’\xff\xfe’) and \
(xml_data[2:4] != b’\x00\x00′):
# UTF-16LE with BOM
sniffed_xml_encoding = ‘utf-16le’
xml_data = str(xml_data[2:], ‘utf-16le’).encode(‘utf-8′)
elif xml_data[:4] == b’\x00\x00\x00\x3c’:
# UTF-32BE
sniffed_xml_encoding = ‘utf-32be’
xml_data = str(xml_data, ‘utf-32be’).encode(‘utf-8′)
elif xml_data[:4] == b’\x3c\x00\x00\x00’:
# UTF-32LE
sniffed_xml_encoding = ‘utf-32le’
xml_data = str(xml_data, ‘utf-32le’).encode(‘utf-8′)
elif xml_data[:4] == b’\x00\x00\xfe\xff’:
# UTF-32BE with BOM
sniffed_xml_encoding = ‘utf-32be’
xml_data = str(xml_data[4:], ‘utf-32be’).encode(‘utf-8′)
elif xml_data[:4] == b’\xff\xfe\x00\x00’:
# UTF-32LE with BOM
sniffed_xml_encoding = ‘utf-32le’
xml_data = str(xml_data[4:], ‘utf-32le’).encode(‘utf-8′)
elif xml_data[:3] == b’\xef\xbb\xbf’:
# UTF-8 with BOM
sniffed_xml_encoding = ‘utf-8’
xml_data = str(xml_data[3:], ‘utf-8’).encode(‘utf-8’)
else:
sniffed_xml_encoding = ‘ascii’
pass
except:
xml_encoding_match = None
xml_encoding_match = xml_encoding_re.match(xml_data)
if not xml_encoding_match and is_html:
xml_encoding_match = html_meta_re.search(xml_data)
if xml_encoding_match is not None:
xml_encoding = xml_encoding_match.groups()[0].decode(
‘ascii’).lower()
if is_html:
self.declared_html_encoding = xml_encoding
if sniffed_xml_encoding and \
(xml_encoding in (‘iso-10646-ucs-2’, ‘ucs-2’, ‘csunicode’,
‘iso-10646-ucs-4’, ‘ucs-4’, ‘csucs4’,
‘utf-16’, ‘utf-32’, ‘utf_16’, ‘utf_32’,
‘utf16’, ‘u16′)):
xml_encoding = sniffed_xml_encoding
return xml_data, xml_encoding, sniffed_xml_encoding
def find_codec(self, charset):
return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \
or (charset and self._codec(charset.replace(“-“, “”))) \
or (charset and self._codec(charset.replace(“-“, “_”))) \
or charset
def _codec(self, charset):
if not charset:
return charset
codec = None
try:
codecs.lookup(charset)
codec = charset
except (LookupError, ValueError):
pass
return codec
EBCDIC_TO_ASCII_MAP = None
def _ebcdic_to_ascii(self, s):
c = self.__class__
if not c.EBCDIC_TO_ASCII_MAP:
emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15,
16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31,
128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7,
144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26,
32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33,
38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94,
45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63,
186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34,
195,97,98,99,100,101,102,103,104,105,196,197,198,199,200,
201,202,106,107,108,109,110,111,112,113,114,203,204,205,
206,207,208,209,126,115,116,117,118,119,120,121,122,210,
211,212,213,214,215,216,217,218,219,220,221,222,223,224,
225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72,
73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81,
82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89,
90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57,
250,251,252,253,254,255)
import string
c.EBCDIC_TO_ASCII_MAP = string.maketrans(
”.join(map(chr, list(range(256)))), ”.join(map(chr, emap)))
return s.translate(c.EBCDIC_TO_ASCII_MAP)
# A partial mapping of ISO-Latin-1 to HTML entities/XML numeric entities.
MS_CHARS = {b’\x80’: (‘euro’, ’20AC’),
b’\x81′: ‘ ‘,
b’\x82’: (‘sbquo’, ‘201A’),
b’\x83′: (‘fnof’, ‘192’),
b’\x84′: (‘bdquo’, ‘201E’),
b’\x85′: (‘hellip’, ‘2026’),
b’\x86′: (‘dagger’, ‘2020’),
b’\x87′: (‘Dagger’, ‘2021’),
b’\x88′: (‘circ’, ‘2C6′),
b’\x89’: (‘permil’, ‘2030’),
b’\x8A’: (‘Scaron’, ‘160’),
b’\x8B’: (‘lsaquo’, ‘2039’),
b’\x8C’: (‘OElig’, ‘152’),
b’\x8D’: ‘?’,
b’\x8E’: (‘#x17D’, ’17D’),
b’\x8F’: ‘?’,
b’\x90′: ‘?’,
b’\x91′: (‘lsquo’, ‘2018’),
b’\x92′: (‘rsquo’, ‘2019’),
b’\x93′: (‘ldquo’, ‘201C’),
b’\x94′: (‘rdquo’, ‘201D’),
b’\x95′: (‘bull’, ‘2022’),
b’\x96′: (‘ndash’, ‘2013’),
b’\x97′: (‘mdash’, ‘2014’),
b’\x98′: (’tilde’, ‘2DC’),
b’\x99′: (‘trade’, ‘2122’),
b’\x9a’: (‘scaron’, ‘161’),
b’\x9b’: (‘rsaquo’, ‘203A’),
b’\x9c’: (‘oelig’, ‘153’),
b’\x9d’: ‘?’,
b’\x9e’: (‘#x17E’, ’17E’),
b’\x9f’: (‘Yuml’, ”),}
# A parochial partial mapping of ISO-Latin-1 to ASCII. Contains
# horrors like stripping diacritical marks to turn á into a, but also
# contains non-horrors like turning “ into “.
MS_CHARS_TO_ASCII = {
b’\x80′ : ‘EUR’,
b’\x81′ : ‘ ‘,
b’\x82’ : ‘,’,
b’\x83′ : ‘f’,
b’\x84′ : ‘,,’,
b’\x85′ : ‘…’,
b’\x86′ : ‘+’,
b’\x87′ : ‘++’,
b’\x88′ : ‘^’,
b’\x89′ : ‘%’,
b’\x8a’ : ‘S’,
b’\x8b’ : ‘<',
b'\x8c' : 'OE',
b'\x8d' : '?',
b'\x8e' : 'Z',
b'\x8f' : '?',
b'\x90' : '?',
b'\x91' : "'",
b'\x92' : "'",
b'\x93' : '"',
b'\x94' : '"',
b'\x95' : '*',
b'\x96' : '-',
b'\x97' : '--',
b'\x98' : '~',
b'\x99' : '(TM)',
b'\x9a' : 's',
b'\x9b' : '>‘,
b’\x9c’ : ‘oe’,
b’\x9d’ : ‘?’,
b’\x9e’ : ‘z’,
b’\x9f’ : ‘Y’,
b’\xa0′ : ‘ ‘,
b’\xa1’ : ‘!’,
b’\xa2′ : ‘c’,
b’\xa3′ : ‘GBP’,
b’\xa4′ : ‘$’, #This approximation is especially parochial–this is the
#generic currency symbol.
b’\xa5′ : ‘YEN’,
b’\xa6′ : ‘|’,
b’\xa7′ : ‘S’,
b’\xa8′ : ‘..’,
b’\xa9′ : ”,
b’\xaa’ : ‘(th)’,
b’\xab’ : ‘<<',
b'\xac' : '!',
b'\xad' : ' ',
b'\xae' : '(R)',
b'\xaf' : '-',
b'\xb0' : 'o',
b'\xb1' : '+-',
b'\xb2' : '2',
b'\xb3' : '3',
b'\xb4' : ("'", 'acute'),
b'\xb5' : 'u',
b'\xb6' : 'P',
b'\xb7' : '*',
b'\xb8' : ',',
b'\xb9' : '1',
b'\xba' : '(th)',
b'\xbb' : '>>’,
b’\xbc’ : ‘1/4′,
b’\xbd’ : ‘1/2′,
b’\xbe’ : ‘3/4′,
b’\xbf’ : ‘?’,
b’\xc0′ : ‘A’,
b’\xc1′ : ‘A’,
b’\xc2′ : ‘A’,
b’\xc3′ : ‘A’,
b’\xc4′ : ‘A’,
b’\xc5′ : ‘A’,
b’\xc6′ : ‘AE’,
b’\xc7′ : ‘C’,
b’\xc8′ : ‘E’,
b’\xc9′ : ‘E’,
b’\xca’ : ‘E’,
b’\xcb’ : ‘E’,
b’\xcc’ : ‘I’,
b’\xcd’ : ‘I’,
b’\xce’ : ‘I’,
b’\xcf’ : ‘I’,
b’\xd0′ : ‘D’,
b’\xd1′ : ‘N’,
b’\xd2′ : ‘O’,
b’\xd3′ : ‘O’,
b’\xd4′ : ‘O’,
b’\xd5′ : ‘O’,
b’\xd6′ : ‘O’,
b’\xd7′ : ‘*’,
b’\xd8′ : ‘O’,
b’\xd9′ : ‘U’,
b’\xda’ : ‘U’,
b’\xdb’ : ‘U’,
b’\xdc’ : ‘U’,
b’\xdd’ : ‘Y’,
b’\xde’ : ‘b’,
b’\xdf’ : ‘B’,
b’\xe0′ : ‘a’,
b’\xe1′ : ‘a’,
b’\xe2′ : ‘a’,
b’\xe3′ : ‘a’,
b’\xe4′ : ‘a’,
b’\xe5′ : ‘a’,
b’\xe6′ : ‘ae’,
b’\xe7′ : ‘c’,
b’\xe8′ : ‘e’,
b’\xe9′ : ‘e’,
b’\xea’ : ‘e’,
b’\xeb’ : ‘e’,
b’\xec’ : ‘i’,
b’\xed’ : ‘i’,
b’\xee’ : ‘i’,
b’\xef’ : ‘i’,
b’\xf0′ : ‘o’,
b’\xf1′ : ‘n’,
b’\xf2′ : ‘o’,
b’\xf3′ : ‘o’,
b’\xf4′ : ‘o’,
b’\xf5′ : ‘o’,
b’\xf6′ : ‘o’,
b’\xf7′ : ‘/’,
b’\xf8′ : ‘o’,
b’\xf9′ : ‘u’,
b’\xfa’ : ‘u’,
b’\xfb’ : ‘u’,
b’\xfc’ : ‘u’,
b’\xfd’ : ‘y’,
b’\xfe’ : ‘b’,
b’\xff’ : ‘y’,
}
# A map used when removing rogue Windows-1252/ISO-8859-1
# characters in otherwise UTF-8 documents.
#
# Note that \x81, \x8d, \x8f, \x90, and \x9d are undefined in
# Windows-1252.
WINDOWS_1252_TO_UTF8 = {
0x80 : b’\xe2\x82\xac’, # €
0x82 : b’\xe2\x80\x9a’, # ‚
0x83 : b’\xc6\x92′, # Æ’
0x84 : b’\xe2\x80\x9e’, # „
0x85 : b’\xe2\x80\xa6′, # …
0x86 : b’\xe2\x80\xa0′, # â€
0x87 : b’\xe2\x80\xa1′, # ‡
0x88 : b’\xcb\x86′, # ˆ
0x89 : b’\xe2\x80\xb0′, # ‰
0x8a : b’\xc5\xa0′, # Å
0x8b : b’\xe2\x80\xb9′, # ‹
0x8c : b’\xc5\x92′, # Å’
0x8e : b’\xc5\xbd’, # Ž
0x91 : b’\xe2\x80\x98′, # ‘
0x92 : b’\xe2\x80\x99′, # ’
0x93 : b’\xe2\x80\x9c’, # “
0x94 : b’\xe2\x80\x9d’, # â€
0x95 : b’\xe2\x80\xa2′, # •
0x96 : b’\xe2\x80\x93′, # –
0x97 : b’\xe2\x80\x94′, # —
0x98 : b’\xcb\x9c’, # Ëœ
0x99 : b’\xe2\x84\xa2′, # â„¢
0x9a : b’\xc5\xa1′, # Å¡
0x9b : b’\xe2\x80\xba’, # ›
0x9c : b’\xc5\x93′, # Å“
0x9e : b’\xc5\xbe’, # ž
0x9f : b’\xc5\xb8′, # Ÿ
0xa0 : b’\xc2\xa0′, # Â
0xa1 : b’\xc2\xa1′, # ¡
0xa2 : b’\xc2\xa2′, # ¢
0xa3 : b’\xc2\xa3′, # £
0xa4 : b’\xc2\xa4′, # ¤
0xa5 : b’\xc2\xa5′, # Â¥
0xa6 : b’\xc2\xa6′, # ¦
0xa7 : b’\xc2\xa7′, # §
0xa8 : b’\xc2\xa8′, # ¨
0xa9 : b’\xc2\xa9′, # ©
0xaa : b’\xc2\xaa’, # ª
0xab : b’\xc2\xab’, # «
0xac : b’\xc2\xac’, # ¬
0xad : b’\xc2\xad’, # Â
0xae : b’\xc2\xae’, # ®
0xaf : b’\xc2\xaf’, # ¯
0xb0 : b’\xc2\xb0′, # °
0xb1 : b’\xc2\xb1′, # ±
0xb2 : b’\xc2\xb2′, # ²
0xb3 : b’\xc2\xb3′, # ³
0xb4 : b’\xc2\xb4′, # ´
0xb5 : b’\xc2\xb5′, # µ
0xb6 : b’\xc2\xb6′, # ¶
0xb7 : b’\xc2\xb7′, # ·
0xb8 : b’\xc2\xb8′, # ¸
0xb9 : b’\xc2\xb9′, # ¹
0xba : b’\xc2\xba’, # º
0xbb : b’\xc2\xbb’, # »
0xbc : b’\xc2\xbc’, # ¼
0xbd : b’\xc2\xbd’, # ½
0xbe : b’\xc2\xbe’, # ¾
0xbf : b’\xc2\xbf’, # ¿
0xc0 : b’\xc3\x80′, # À
0xc1 : b’\xc3\x81′, # Ã
0xc2 : b’\xc3\x82′, # Â
0xc3 : b’\xc3\x83′, # Ã
0xc4 : b’\xc3\x84′, # Ä
0xc5 : b’\xc3\x85′, # Ã…
0xc6 : b’\xc3\x86′, # Æ
0xc7 : b’\xc3\x87′, # Ç
0xc8 : b’\xc3\x88′, # È
0xc9 : b’\xc3\x89′, # É
0xca : b’\xc3\x8a’, # Ê
0xcb : b’\xc3\x8b’, # Ë
0xcc : b’\xc3\x8c’, # ÃŒ
0xcd : b’\xc3\x8d’, # Ã
0xce : b’\xc3\x8e’, # ÃŽ
0xcf : b’\xc3\x8f’, # Ã
0xd0 : b’\xc3\x90′, # Ã
0xd1 : b’\xc3\x91′, # Ñ
0xd2 : b’\xc3\x92′, # Ã’
0xd3 : b’\xc3\x93′, # Ó
0xd4 : b’\xc3\x94′, # Ô
0xd5 : b’\xc3\x95′, # Õ
0xd6 : b’\xc3\x96′, # Ö
0xd7 : b’\xc3\x97′, # ×
0xd8 : b’\xc3\x98′, # Ø
0xd9 : b’\xc3\x99′, # Ù
0xda : b’\xc3\x9a’, # Ú
0xdb : b’\xc3\x9b’, # Û
0xdc : b’\xc3\x9c’, # Ü
0xdd : b’\xc3\x9d’, # Ã
0xde : b’\xc3\x9e’, # Þ
0xdf : b’\xc3\x9f’, # ß
0xe0 : b’\xc3\xa0′, # Ã
0xe1 : b’\xa1′, # á
0xe2 : b’\xc3\xa2′, # â
0xe3 : b’\xc3\xa3′, # ã
0xe4 : b’\xc3\xa4′, # ä
0xe5 : b’\xc3\xa5′, # Ã¥
0xe6 : b’\xc3\xa6′, # æ
0xe7 : b’\xc3\xa7′, # ç
0xe8 : b’\xc3\xa8′, # è
0xe9 : b’\xc3\xa9′, # é
0xea : b’\xc3\xaa’, # ê
0xeb : b’\xc3\xab’, # ë
0xec : b’\xc3\xac’, # ì
0xed : b’\xc3\xad’, # Ã
0xee : b’\xc3\xae’, # î
0xef : b’\xc3\xaf’, # ï
0xf0 : b’\xc3\xb0′, # ð
0xf1 : b’\xc3\xb1′, # ñ
0xf2 : b’\xc3\xb2′, # ò
0xf3 : b’\xc3\xb3′, # ó
0xf4 : b’\xc3\xb4′, # ô
0xf5 : b’\xc3\xb5′, # õ
0xf6 : b’\xc3\xb6′, # ö
0xf7 : b’\xc3\xb7′, # ÷
0xf8 : b’\xc3\xb8′, # ø
0xf9 : b’\xc3\xb9′, # ù
0xfa : b’\xc3\xba’, # ú
0xfb : b’\xc3\xbb’, # û
0xfc : b’\xc3\xbc’, # ü
0xfd : b’\xc3\xbd’, # ý
0xfe : b’\xc3\xbe’, # þ
}
MULTIBYTE_MARKERS_AND_SIZES = [
(0xc2, 0xdf, 2), # 2-byte characters start with a byte C2-DF
(0xe0, 0xef, 3), # 3-byte characters start with E0-EF
(0xf0, 0xf4, 4), # 4-byte characters start with F0-F4
]
FIRST_MULTIBYTE_MARKER = MULTIBYTE_MARKERS_AND_SIZES[0][0]
LAST_MULTIBYTE_MARKER = MULTIBYTE_MARKERS_AND_SIZES[-1][1]
@classmethod
def detwingle(cls, in_bytes, main_encoding=”utf8″,
embedded_encoding=”windows-1252″):
“””Fix characters from one encoding embedded in some other encoding.
Currently the only situation supported is Windows-1252 (or its
subset ISO-8859-1), embedded in UTF-8.
The input must be a bytestring. If you’ve already converted
the document to Unicode, you’re too late.
The output is a bytestring in which `embedded_encoding`
characters have been converted to their `main_encoding`
equivalents.
“””
if embedded_encoding.replace(‘_’, ‘-‘).lower() not in (
‘windows-1252’, ‘windows_1252’):
raise NotImplementedError(
“Windows-1252 and ISO-8859-1 are the only currently supported ”
“embedded encodings.”)
if main_encoding.lower() not in (‘utf8’, ‘utf-8’):
raise NotImplementedError(
“UTF-8 is the only currently supported main encoding.”)
byte_chunks = []
chunk_start = 0
pos = 0
while pos < len(in_bytes):
byte = in_bytes[pos]
if not isinstance(byte, int):
# Python 2.x
byte = ord(byte)
if (byte >= cls.FIRST_MULTIBYTE_MARKER
and byte <= cls.LAST_MULTIBYTE_MARKER):
# This is the start of a UTF-8 multibyte character. Skip
# to the end.
for start, end, size in cls.MULTIBYTE_MARKERS_AND_SIZES:
if byte >= start and byte <= end:
pos += size
break
elif byte >= 0x80 and byte in cls.WINDOWS_1252_TO_UTF8:
# We found a Windows-1252 character!
# Save the string up to this point as a chunk.
byte_chunks.append(in_bytes[chunk_start:pos])
# Now translate the Windows-1252 character into UTF-8
# and add it as another, one-byte chunk.
byte_chunks.append(cls.WINDOWS_1252_TO_UTF8[byte])
pos += 1
chunk_start = pos
else:
# Go on to the next character.
pos += 1
if chunk_start == 0:
# The string is unchanged.
return in_bytes
else:
# Store the final chunk.
byte_chunks.append(in_bytes[chunk_start:])
return b”.join(byte_chunks)
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/build/lib/bs4/element.py
import collections
import re
import sys
import warnings
from bs4.dammit import EntitySubstitution
DEFAULT_OUTPUT_ENCODING = “utf-8”
PY3K = (sys.version_info[0] > 2)
whitespace_re = re.compile(“\s+”)
def _alias(attr):
“””Alias one attribute name to another for backward compatibility”””
@property
def alias(self):
return getattr(self, attr)
@alias.setter
def alias(self):
return setattr(self, attr)
return alias
class NamespacedAttribute(str):
def __new__(cls, prefix, name, namespace=None):
if name is None:
obj = str.__new__(cls, prefix)
else:
obj = str.__new__(cls, prefix + “:” + name)
obj.prefix = prefix
obj.name = name
obj.namespace = namespace
return obj
class AttributeValueWithCharsetSubstitution(str):
“””A stand-in object for a character encoding specified in HTML.”””
class CharsetMetaAttributeValue(AttributeValueWithCharsetSubstitution):
“””A generic stand-in for the value of a meta tag’s ‘charset’ attribute.
When Beautiful Soup parses the markup ‘‘, the
value of the ‘charset’ attribute will be one of these objects.
“””
def __new__(cls, original_value):
obj = str.__new__(cls, original_value)
obj.original_value = original_value
return obj
def encode(self, encoding):
return encoding
class ContentMetaAttributeValue(AttributeValueWithCharsetSubstitution):
“””A generic stand-in for the value of a meta tag’s ‘content’ attribute.
When Beautiful Soup parses the markup:
The value of the ‘content’ attribute will be one of these objects.
“””
CHARSET_RE = re.compile(“((^|;)\s*charset=)([^;]*)”, re.M)
def __new__(cls, original_value):
match = cls.CHARSET_RE.search(original_value)
if match is None:
# No substitution necessary.
return str.__new__(str, original_value)
obj = str.__new__(cls, original_value)
obj.original_value = original_value
return obj
def encode(self, encoding):
def rewrite(match):
return match.group(1) + encoding
return self.CHARSET_RE.sub(rewrite, self.original_value)
class PageElement(object):
“””Contains the navigational information for some part of the page
(either a tag or a piece of text)”””
# There are five possible values for the “formatter” argument passed in
# to methods like encode() and prettify():
#
# “html” – All Unicode characters with corresponding HTML entities
# are converted to those entities on output.
# “minimal” – Bare ampersands and angle brackets are converted to
# XML entities: & < >
# None – The null formatter. Unicode characters are never
# converted to entities. This is not recommended, but it’s
# faster than “minimal”.
# A function – This function will be called on every string that
# needs to undergo entity substition
FORMATTERS = {
“html” : EntitySubstitution.substitute_html,
“minimal” : EntitySubstitution.substitute_xml,
None : None
}
@classmethod
def format_string(self, s, formatter=’minimal’):
“””Format the given string using the given formatter.”””
if not callable(formatter):
formatter = self.FORMATTERS.get(
formatter, EntitySubstitution.substitute_xml)
if formatter is None:
output = s
else:
output = formatter(s)
return output
def setup(self, parent=None, previous_element=None):
“””Sets up the initial relations between this element and
other elements.”””
self.parent = parent
self.previous_element = previous_element
if previous_element is not None:
self.previous_element.next_element = self
self.next_element = None
self.previous_sibling = None
self.next_sibling = None
if self.parent is not None and self.parent.contents:
self.previous_sibling = self.parent.contents[-1]
self.previous_sibling.next_sibling = self
nextSibling = _alias(“next_sibling”) # BS3
previousSibling = _alias(“previous_sibling”) # BS3
def replace_with(self, replace_with):
if replace_with is self:
return
if replace_with is self.parent:
raise ValueError(“Cannot replace a Tag with its parent.”)
old_parent = self.parent
my_index = self.parent.index(self)
self.extract()
old_parent.insert(my_index, replace_with)
return self
replaceWith = replace_with # BS3
def unwrap(self):
my_parent = self.parent
my_index = self.parent.index(self)
self.extract()
for child in reversed(self.contents[:]):
my_parent.insert(my_index, child)
return self
replace_with_children = unwrap
replaceWithChildren = unwrap # BS3
def wrap(self, wrap_inside):
me = self.replace_with(wrap_inside)
wrap_inside.append(me)
return wrap_inside
def extract(self):
“””Destructively rips this element out of the tree.”””
if self.parent is not None:
del self.parent.contents[self.parent.index(self)]
#Find the two elements that would be next to each other if
#this element (and any children) hadn’t been parsed. Connect
#the two.
last_child = self._last_descendant()
next_element = last_child.next_element
if self.previous_element is not None:
self.previous_element.next_element = next_element
if next_element is not None:
next_element.previous_element = self.previous_element
self.previous_element = None
last_child.next_element = None
self.parent = None
if self.previous_sibling is not None:
self.previous_sibling.next_sibling = self.next_sibling
if self.next_sibling is not None:
self.next_sibling.previous_sibling = self.previous_sibling
self.previous_sibling = self.next_sibling = None
return self
def _last_descendant(self):
“Finds the last element beneath this object to be parsed.”
last_child = self
while hasattr(last_child, ‘contents’) and last_child.contents:
last_child = last_child.contents[-1]
return last_child
# BS3: Not part of the API!
_lastRecursiveChild = _last_descendant
def insert(self, position, new_child):
if new_child is self:
raise ValueError(“Cannot insert a tag into itself.”)
if (isinstance(new_child, str)
and not isinstance(new_child, NavigableString)):
new_child = NavigableString(new_child)
position = min(position, len(self.contents))
if hasattr(new_child, ‘parent’) and new_child.parent is not None:
# We’re ‘inserting’ an element that’s already one
# of this object’s children.
if new_child.parent is self:
current_index = self.index(new_child)
if current_index < position:
# We're moving this element further down the list
# of this object's children. That means that when
# we extract this element, our target index will
# jump down one.
position -= 1
new_child.extract()
new_child.parent = self
previous_child = None
if position == 0:
new_child.previous_sibling = None
new_child.previous_element = self
else:
previous_child = self.contents[position - 1]
new_child.previous_sibling = previous_child
new_child.previous_sibling.next_sibling = new_child
new_child.previous_element = previous_child._last_descendant()
if new_child.previous_element is not None:
new_child.previous_element.next_element = new_child
new_childs_last_element = new_child._last_descendant()
if position >= len(self.contents):
new_child.next_sibling = None
parent = self
parents_next_sibling = None
while parents_next_sibling is None and parent is not None:
parents_next_sibling = parent.next_sibling
parent = parent.parent
if parents_next_sibling is not None:
# We found the element that comes next in the document.
break
if parents_next_sibling is not None:
new_childs_last_element.next_element = parents_next_sibling
else:
# The last element of this tag is the last element in
# the document.
new_childs_last_element.next_element = None
else:
next_child = self.contents[position]
new_child.next_sibling = next_child
if new_child.next_sibling is not None:
new_child.next_sibling.previous_sibling = new_child
new_childs_last_element.next_element = next_child
if new_childs_last_element.next_element is not None:
new_childs_last_element.next_element.previous_element = new_childs_last_element
self.contents.insert(position, new_child)
def append(self, tag):
“””Appends the given tag to the contents of this tag.”””
self.insert(len(self.contents), tag)
def insert_before(self, predecessor):
“””Makes the given element the immediate predecessor of this one.
The two elements will have the same parent, and the given element
will be immediately before this one.
“””
if self is predecessor:
raise ValueError(“Can’t insert an element before itself.”)
parent = self.parent
if parent is None:
raise ValueError(
“Element has no parent, so ‘before’ has no meaning.”)
# Extract first so that the index won’t be screwed up if they
# are siblings.
if isinstance(predecessor, PageElement):
predecessor.extract()
index = parent.index(self)
parent.insert(index, predecessor)
def insert_after(self, successor):
“””Makes the given element the immediate successor of this one.
The two elements will have the same parent, and the given element
will be immediately after this one.
“””
if self is successor:
raise ValueError(“Can’t insert an element after itself.”)
parent = self.parent
if parent is None:
raise ValueError(
“Element has no parent, so ‘after’ has no meaning.”)
# Extract first so that the index won’t be screwed up if they
# are siblings.
if isinstance(successor, PageElement):
successor.extract()
index = parent.index(self)
parent.insert(index+1, successor)
def find_next(self, name=None, attrs={}, text=None, **kwargs):
“””Returns the first item that matches the given criteria and
appears after this Tag in the document.”””
return self._find_one(self.find_all_next, name, attrs, text, **kwargs)
findNext = find_next # BS3
def find_all_next(self, name=None, attrs={}, text=None, limit=None,
**kwargs):
“””Returns all items that match the given criteria and appear
after this Tag in the document.”””
return self._find_all(name, attrs, text, limit, self.next_elements,
**kwargs)
findAllNext = find_all_next # BS3
def find_next_sibling(self, name=None, attrs={}, text=None, **kwargs):
“””Returns the closest sibling to this Tag that matches the
given criteria and appears after this Tag in the document.”””
return self._find_one(self.find_next_siblings, name, attrs, text,
**kwargs)
findNextSibling = find_next_sibling # BS3
def find_next_siblings(self, name=None, attrs={}, text=None, limit=None,
**kwargs):
“””Returns the siblings of this Tag that match the given
criteria and appear after this Tag in the document.”””
return self._find_all(name, attrs, text, limit,
self.next_siblings, **kwargs)
findNextSiblings = find_next_siblings # BS3
fetchNextSiblings = find_next_siblings # BS2
def find_previous(self, name=None, attrs={}, text=None, **kwargs):
“””Returns the first item that matches the given criteria and
appears before this Tag in the document.”””
return self._find_one(
self.find_all_previous, name, attrs, text, **kwargs)
findPrevious = find_previous # BS3
def find_all_previous(self, name=None, attrs={}, text=None, limit=None,
**kwargs):
“””Returns all items that match the given criteria and appear
before this Tag in the document.”””
return self._find_all(name, attrs, text, limit, self.previous_elements,
**kwargs)
findAllPrevious = find_all_previous # BS3
fetchPrevious = find_all_previous # BS2
def find_previous_sibling(self, name=None, attrs={}, text=None, **kwargs):
“””Returns the closest sibling to this Tag that matches the
given criteria and appears before this Tag in the document.”””
return self._find_one(self.find_previous_siblings, name, attrs, text,
**kwargs)
findPreviousSibling = find_previous_sibling # BS3
def find_previous_siblings(self, name=None, attrs={}, text=None,
limit=None, **kwargs):
“””Returns the siblings of this Tag that match the given
criteria and appear before this Tag in the document.”””
return self._find_all(name, attrs, text, limit,
self.previous_siblings, **kwargs)
findPreviousSiblings = find_previous_siblings # BS3
fetchPreviousSiblings = find_previous_siblings # BS2
def find_parent(self, name=None, attrs={}, **kwargs):
“””Returns the closest parent of this Tag that matches the given
criteria.”””
# NOTE: We can’t use _find_one because findParents takes a different
# set of arguments.
r = None
l = self.find_parents(name, attrs, 1)
if l:
r = l[0]
return r
findParent = find_parent # BS3
def find_parents(self, name=None, attrs={}, limit=None, **kwargs):
“””Returns the parents of this Tag that match the given
criteria.”””
return self._find_all(name, attrs, None, limit, self.parents,
**kwargs)
findParents = find_parents # BS3
fetchParents = find_parents # BS2
@property
def next(self):
return self.next_element
@property
def previous(self):
return self.previous_element
#These methods do the real heavy lifting.
def _find_one(self, method, name, attrs, text, **kwargs):
r = None
l = method(name, attrs, text, 1, **kwargs)
if l:
r = l[0]
return r
def _find_all(self, name, attrs, text, limit, generator, **kwargs):
“Iterates over a generator looking for things that match.”
if isinstance(name, SoupStrainer):
strainer = name
elif text is None and not limit and not attrs and not kwargs:
# Optimization to find all tags.
if name is True or name is None:
return [element for element in generator
if isinstance(element, Tag)]
# Optimization to find all tags with a given name.
elif isinstance(name, str):
return [element for element in generator
if isinstance(element, Tag) and element.name == name]
else:
strainer = SoupStrainer(name, attrs, text, **kwargs)
else:
# Build a SoupStrainer
strainer = SoupStrainer(name, attrs, text, **kwargs)
results = ResultSet(strainer)
while True:
try:
i = next(generator)
except StopIteration:
break
if i:
found = strainer.search(i)
if found:
results.append(found)
if limit and len(results) >= limit:
break
return results
#These generators can be used to navigate starting from both
#NavigableStrings and Tags.
@property
def next_elements(self):
i = self.next_element
while i is not None:
yield i
i = i.next_element
@property
def next_siblings(self):
i = self.next_sibling
while i is not None:
yield i
i = i.next_sibling
@property
def previous_elements(self):
i = self.previous_element
while i is not None:
yield i
i = i.previous_element
@property
def previous_siblings(self):
i = self.previous_sibling
while i is not None:
yield i
i = i.previous_sibling
@property
def parents(self):
i = self.parent
while i is not None:
yield i
i = i.parent
# Methods for supporting CSS selectors.
tag_name_re = re.compile(‘^[a-z0-9]+$’)
# /^(\w+)\[(\w+)([=~\|\^\$\*]?)=?”?([^\]”]*)”?\]$/
# \—/ \—/\————-/ \——-/
# | | | |
# | | | The value
# | | ~,|,^,$,* or =
# | Attribute
# Tag
attribselect_re = re.compile(
r’^(?P
r’=?”?(?P
)
def _attr_value_as_string(self, value, default=None):
“””Force an attribute value into a string representation.
A multi-valued attribute will be converted into a
space-separated stirng.
“””
value = self.get(value, default)
if isinstance(value, list) or isinstance(value, tuple):
value =” “.join(value)
return value
def _attribute_checker(self, operator, attribute, value=”):
“””Create a function that performs a CSS selector operation.
Takes an operator, attribute and optional value. Returns a
function that will return True for elements that match that
combination.
“””
if operator == ‘=’:
# string representation of `attribute` is equal to `value`
return lambda el: el._attr_value_as_string(attribute) == value
elif operator == ‘~’:
# space-separated list representation of `attribute`
# contains `value`
def _includes_value(element):
attribute_value = element.get(attribute, [])
if not isinstance(attribute_value, list):
attribute_value = attribute_value.split()
return value in attribute_value
return _includes_value
elif operator == ‘^’:
# string representation of `attribute` starts with `value`
return lambda el: el._attr_value_as_string(
attribute, ”).startswith(value)
elif operator == ‘$’:
# string represenation of `attribute` ends with `value`
return lambda el: el._attr_value_as_string(
attribute, ”).endswith(value)
elif operator == ‘*’:
# string representation of `attribute` contains `value`
return lambda el: value in el._attr_value_as_string(attribute, ”)
elif operator == ‘|’:
# string representation of `attribute` is either exactly
# `value` or starts with `value` and then a dash.
def _is_or_starts_with_dash(element):
attribute_value = element._attr_value_as_string(attribute, ”)
return (attribute_value == value or attribute_value.startswith(
value + ‘-‘))
return _is_or_starts_with_dash
else:
return lambda el: el.has_attr(attribute)
def select(self, selector):
“””Perform a CSS selection operation on the current element.”””
tokens = selector.split()
current_context = [self]
for index, token in enumerate(tokens):
if tokens[index – 1] == ‘>’:
# already found direct descendants in last step. skip this
# step.
continue
m = self.attribselect_re.match(token)
if m is not None:
# Attribute selector
tag, attribute, operator, value = m.groups()
if not tag:
tag = True
checker = self._attribute_checker(operator, attribute, value)
found = []
for context in current_context:
found.extend(
[el for el in context.find_all(tag) if checker(el)])
current_context = found
continue
if ‘#’ in token:
# ID selector
tag, id = token.split(‘#’, 1)
if tag == “”:
tag = True
el = current_context[0].find(tag, {‘id’: id})
if el is None:
return [] # No match
current_context = [el]
continue
if ‘.’ in token:
# Class selector
tag_name, klass = token.split(‘.’, 1)
if not tag_name:
tag_name = True
classes = set(klass.split(‘.’))
found = []
def classes_match(tag):
if tag_name is not True and tag.name != tag_name:
return False
if not tag.has_attr(‘class’):
return False
return classes.issubset(tag[‘class’])
for context in current_context:
found.extend(context.find_all(classes_match))
current_context = found
continue
if token == ‘*’:
# Star selector
found = []
for context in current_context:
found.extend(context.findAll(True))
current_context = found
continue
if token == ‘>’:
# Child selector
tag = tokens[index + 1]
if not tag:
tag = True
found = []
for context in current_context:
found.extend(context.find_all(tag, recursive=False))
current_context = found
continue
# Here we should just have a regular tag
if not self.tag_name_re.match(token):
return []
found = []
for context in current_context:
found.extend(context.findAll(token))
current_context = found
return current_context
# Old non-property versions of the generators, for backwards
# compatibility with BS3.
def nextGenerator(self):
return self.next_elements
def nextSiblingGenerator(self):
return self.next_siblings
def previousGenerator(self):
return self.previous_elements
def previousSiblingGenerator(self):
return self.previous_siblings
def parentGenerator(self):
return self.parents
class NavigableString(str, PageElement):
PREFIX = ”
SUFFIX = ”
def __new__(cls, value):
“””Create a new NavigableString.
When unpickling a NavigableString, this method is called with
the string in DEFAULT_OUTPUT_ENCODING. That encoding needs to be
passed in to the superclass’s __new__ or the superclass won’t know
how to handle non-ASCII characters.
“””
if isinstance(value, str):
return str.__new__(cls, value)
return str.__new__(cls, value, DEFAULT_OUTPUT_ENCODING)
def __getnewargs__(self):
return (str(self),)
def __getattr__(self, attr):
“””text.string gives you text. This is for backwards
compatibility for Navigable*String, but for CData* it lets you
get the string without the CData wrapper.”””
if attr == ‘string’:
return self
else:
raise AttributeError(
“‘%s’ object has no attribute ‘%s'” % (
self.__class__.__name__, attr))
def output_ready(self, formatter=”minimal”):
output = self.format_string(self, formatter)
return self.PREFIX + output + self.SUFFIX
class PreformattedString(NavigableString):
“””A NavigableString not subject to the normal formatting rules.
The string will be passed into the formatter (to trigger side effects),
but the return value will be ignored.
“””
def output_ready(self, formatter=”minimal”):
“””CData strings are passed into the formatter.
But the return value is ignored.”””
self.format_string(self, formatter)
return self.PREFIX + self + self.SUFFIX
class CData(PreformattedString):
PREFIX = ‘
class ProcessingInstruction(PreformattedString):
PREFIX = ‘'
SUFFIX = '?>‘
class Comment(PreformattedString):
PREFIX = ‘‘
class Declaration(PreformattedString):
PREFIX = ‘‘
class Doctype(PreformattedString):
@classmethod
def for_name_and_ids(cls, name, pub_id, system_id):
value = name
if pub_id is not None:
value += ‘ PUBLIC “%s”‘ % pub_id
if system_id is not None:
value += ‘ “%s”‘ % system_id
elif system_id is not None:
value += ‘ SYSTEM “%s”‘ % system_id
return Doctype(value)
PREFIX = ‘\n’
class Tag(PageElement):
“””Represents a found HTML tag with its attributes and contents.”””
def __init__(self, parser=None, builder=None, name=None, namespace=None,
prefix=None, attrs=None, parent=None, previous=None):
“Basic constructor.”
if parser is None:
self.parser_class = None
else:
# We don’t actually store the parser object: that lets extracted
# chunks be garbage-collected.
self.parser_class = parser.__class__
if name is None:
raise ValueError(“No value provided for new tag’s name.”)
self.name = name
self.namespace = namespace
self.prefix = prefix
if attrs is None:
attrs = {}
elif builder.cdata_list_attributes:
attrs = builder._replace_cdata_list_attribute_values(
self.name, attrs)
else:
attrs = dict(attrs)
self.attrs = attrs
self.contents = []
self.setup(parent, previous)
self.hidden = False
# Set up any substitutions, such as the charset in a META tag.
if builder is not None:
builder.set_up_substitutions(self)
self.can_be_empty_element = builder.can_be_empty_element(name)
else:
self.can_be_empty_element = False
parserClass = _alias(“parser_class”) # BS3
@property
def is_empty_element(self):
“””Is this tag an empty-element tag? (aka a self-closing tag)
A tag that has contents is never an empty-element tag.
A tag that has no contents may or may not be an empty-element
tag. It depends on the builder used to create the tag. If the
builder has a designated list of empty-element tags, then only
a tag whose name shows up in that list is considered an
empty-element tag.
If the builder has no designated list of empty-element tags,
then any tag with no contents is an empty-element tag.
“””
return len(self.contents) == 0 and self.can_be_empty_element
isSelfClosing = is_empty_element # BS3
@property
def string(self):
“””Convenience property to get the single string within this tag.
:Return: If this tag has a single string child, return value
is that string. If this tag has no children, or more than one
child, return value is None. If this tag has one child tag,
return value is the ‘string’ attribute of the child tag,
recursively.
“””
if len(self.contents) != 1:
return None
child = self.contents[0]
if isinstance(child, NavigableString):
return child
return child.string
@string.setter
def string(self, string):
self.clear()
self.append(string.__class__(string))
def _all_strings(self, strip=False):
“””Yield all child strings, possibly stripping them.”””
for descendant in self.descendants:
if not isinstance(descendant, NavigableString):
continue
if strip:
descendant = descendant.strip()
if len(descendant) == 0:
continue
yield descendant
strings = property(_all_strings)
@property
def stripped_strings(self):
for string in self._all_strings(True):
yield string
def get_text(self, separator=””, strip=False):
“””
Get all child strings, concatenated using the given separator.
“””
return separator.join([s for s in self._all_strings(strip)])
getText = get_text
text = property(get_text)
def decompose(self):
“””Recursively destroys the contents of this tree.”””
self.extract()
i = self
while i is not None:
next = i.next_element
i.__dict__.clear()
i = next
def clear(self, decompose=False):
“””
Extract all children. If decompose is True, decompose instead.
“””
if decompose:
for element in self.contents[:]:
if isinstance(element, Tag):
element.decompose()
else:
element.extract()
else:
for element in self.contents[:]:
element.extract()
def index(self, element):
“””
Find the index of a child by identity, not value. Avoids issues with
tag.contents.index(element) getting the index of equal elements.
“””
for i, child in enumerate(self.contents):
if child is element:
return i
raise ValueError(“Tag.index: element not in tag”)
def get(self, key, default=None):
“””Returns the value of the ‘key’ attribute for the tag, or
the value given for ‘default’ if it doesn’t have that
attribute.”””
return self.attrs.get(key, default)
def has_attr(self, key):
return key in self.attrs
def __hash__(self):
return str(self).__hash__()
def __getitem__(self, key):
“””tag[key] returns the value of the ‘key’ attribute for the tag,
and throws an exception if it’s not there.”””
return self.attrs[key]
def __iter__(self):
“Iterating over a tag iterates over its contents.”
return iter(self.contents)
def __len__(self):
“The length of a tag is the length of its list of contents.”
return len(self.contents)
def __contains__(self, x):
return x in self.contents
def __bool__(self):
“A tag is non-None even if it has no contents.”
return True
def __setitem__(self, key, value):
“””Setting tag[key] sets the value of the ‘key’ attribute for the
tag.”””
self.attrs[key] = value
def __delitem__(self, key):
“Deleting tag[key] deletes all ‘key’ attributes for the tag.”
self.attrs.pop(key, None)
def __call__(self, *args, **kwargs):
“””Calling a tag like a function is the same as calling its
find_all() method. Eg. tag(‘a’) returns a list of all the A tags
found within this tag.”””
return self.find_all(*args, **kwargs)
def __getattr__(self, tag):
#print “Getattr %s.%s” % (self.__class__, tag)
if len(tag) > 3 and tag.endswith(‘Tag’):
# BS3: soup.aTag -> “soup.find(“a”)
tag_name = tag[:-3]
warnings.warn(
‘.%sTag is deprecated, use .find(“%s”) instead.’ % (
tag_name, tag_name))
return self.find(tag_name)
# We special case contents to avoid recursion.
elif not tag.startswith(“__”) and not tag==”contents”:
return self.find(tag)
raise AttributeError(
“‘%s’ object has no attribute ‘%s'” % (self.__class__, tag))
def __eq__(self, other):
“””Returns true iff this tag has the same name, the same attributes,
and the same contents (recursively) as the given tag.”””
if self is other:
return True
if (not hasattr(other, ‘name’) or
not hasattr(other, ‘attrs’) or
not hasattr(other, ‘contents’) or
self.name != other.name or
self.attrs != other.attrs or
len(self) != len(other)):
return False
for i, my_child in enumerate(self.contents):
if my_child != other.contents[i]:
return False
return True
def __ne__(self, other):
“””Returns true iff this tag is not identical to the other tag,
as defined in __eq__.”””
return not self == other
def __repr__(self, encoding=DEFAULT_OUTPUT_ENCODING):
“””Renders this tag as a string.”””
return self.encode(encoding)
def __unicode__(self):
return self.decode()
def __str__(self):
return self.encode()
if PY3K:
__str__ = __repr__ = __unicode__
def encode(self, encoding=DEFAULT_OUTPUT_ENCODING,
indent_level=None, formatter=”minimal”,
errors=”xmlcharrefreplace”):
# Turn the data structure into Unicode, then encode the
# Unicode.
u = self.decode(indent_level, encoding, formatter)
return u.encode(encoding, errors)
def decode(self, indent_level=None,
eventual_encoding=DEFAULT_OUTPUT_ENCODING,
formatter=”minimal”):
“””Returns a Unicode representation of this tag and its contents.
:param eventual_encoding: The tag is destined to be
encoded into this encoding. This method is _not_
responsible for performing that encoding. This information
is passed in so that it can be substituted in if the
document contains a tag that mentions the document’s
encoding.
“””
attrs = []
if self.attrs:
for key, val in sorted(self.attrs.items()):
if val is None:
decoded = key
else:
if isinstance(val, list) or isinstance(val, tuple):
val = ‘ ‘.join(val)
elif not isinstance(val, str):
val = str(val)
elif (
isinstance(val, AttributeValueWithCharsetSubstitution)
and eventual_encoding is not None):
val = val.encode(eventual_encoding)
text = self.format_string(val, formatter)
decoded = (
str(key) + ‘=’
+ EntitySubstitution.quoted_attribute_value(text))
attrs.append(decoded)
close = ”
closeTag = ”
if self.is_empty_element:
close = ‘/’
else:
closeTag = ‘%s>‘ % self.name
prefix = ”
if self.prefix:
prefix = self.prefix + “:”
pretty_print = (indent_level is not None)
if pretty_print:
space = (‘ ‘ * (indent_level – 1))
indent_contents = indent_level + 1
else:
space = ”
indent_contents = None
contents = self.decode_contents(
indent_contents, eventual_encoding, formatter)
if self.hidden:
# This is the ‘document root’ object.
s = contents
else:
s = []
attribute_string = ”
if attrs:
attribute_string = ‘ ‘ + ‘ ‘.join(attrs)
if pretty_print:
s.append(space)
s.append(‘<%s%s%s%s>‘ % (
prefix, self.name, attribute_string, close))
if pretty_print:
s.append(“\n”)
s.append(contents)
if pretty_print and contents and contents[-1] != “\n”:
s.append(“\n”)
if pretty_print and closeTag:
s.append(space)
s.append(closeTag)
if pretty_print and closeTag and self.next_sibling:
s.append(“\n”)
s = ”.join(s)
return s
def prettify(self, encoding=None, formatter=”minimal”):
if encoding is None:
return self.decode(True, formatter=formatter)
else:
return self.encode(encoding, True, formatter=formatter)
def decode_contents(self, indent_level=None,
eventual_encoding=DEFAULT_OUTPUT_ENCODING,
formatter=”minimal”):
“””Renders the contents of this tag as a Unicode string.
:param eventual_encoding: The tag is destined to be
encoded into this encoding. This method is _not_
responsible for performing that encoding. This information
is passed in so that it can be substituted in if the
document contains a tag that mentions the document’s
encoding.
“””
pretty_print = (indent_level is not None)
s = []
for c in self:
text = None
if isinstance(c, NavigableString):
text = c.output_ready(formatter)
elif isinstance(c, Tag):
s.append(c.decode(indent_level, eventual_encoding,
formatter))
if text and indent_level:
text = text.strip()
if text:
if pretty_print:
s.append(” ” * (indent_level – 1))
s.append(text)
if pretty_print:
s.append(“\n”)
return ”.join(s)
def encode_contents(
self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING,
formatter=”minimal”):
“””Renders the contents of this tag as a bytestring.”””
contents = self.decode_contents(indent_level, encoding, formatter)
return contents.encode(encoding)
# Old method for BS3 compatibility
def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
prettyPrint=False, indentLevel=0):
if not prettyPrint:
indentLevel = None
return self.encode_contents(
indent_level=indentLevel, encoding=encoding)
#Soup methods
def find(self, name=None, attrs={}, recursive=True, text=None,
**kwargs):
“””Return only the first child of this Tag matching the given
criteria.”””
r = None
l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
if l:
r = l[0]
return r
findChild = find
def find_all(self, name=None, attrs={}, recursive=True, text=None,
limit=None, **kwargs):
“””Extracts a list of Tag objects that match the given
criteria. You can specify the name of the Tag and any
attributes you want the Tag to have.
The value of a key-value pair in the ‘attrs’ map can be a
string, a list of strings, a regular expression object, or a
callable that takes a string and returns whether or not the
string matches for some custom definition of ‘matches’. The
same is true of the tag name.”””
generator = self.descendants
if not recursive:
generator = self.children
return self._find_all(name, attrs, text, limit, generator, **kwargs)
findAll = find_all # BS3
findChildren = find_all # BS2
#Generator methods
@property
def children(self):
# return iter() to make the purpose of the method clear
return iter(self.contents) # XXX This seems to be untested.
@property
def descendants(self):
if not len(self.contents):
return
stopNode = self._last_descendant().next_element
current = self.contents[0]
while current is not stopNode:
yield current
current = current.next_element
# Old names for backwards compatibility
def childGenerator(self):
return self.children
def recursiveChildGenerator(self):
return self.descendants
# This was kind of misleading because has_key() (attributes) was
# different from __in__ (contents). has_key() is gone in Python 3,
# anyway.
has_key = has_attr
# Next, a couple classes to represent queries and their results.
class SoupStrainer(object):
“””Encapsulates a number of ways of matching a markup element (tag or
text).”””
def __init__(self, name=None, attrs={}, text=None, **kwargs):
self.name = self._normalize_search_value(name)
if not isinstance(attrs, dict):
# Treat a non-dict value for attrs as a search for the ‘class’
# attribute.
kwargs[‘class’] = attrs
attrs = None
if kwargs:
if attrs:
attrs = attrs.copy()
attrs.update(kwargs)
else:
attrs = kwargs
normalized_attrs = {}
for key, value in list(attrs.items()):
normalized_attrs[key] = self._normalize_search_value(value)
self.attrs = normalized_attrs
self.text = self._normalize_search_value(text)
def _normalize_search_value(self, value):
# Leave it alone if it’s a Unicode string, a callable, a
# regular expression, a boolean, or None.
if (isinstance(value, str) or callable(value) or hasattr(value, ‘match’)
or isinstance(value, bool) or value is None):
return value
# If it’s a bytestring, convert it to Unicode, treating it as UTF-8.
if isinstance(value, bytes):
return value.decode(“utf8”)
# If it’s listlike, convert it into a list of strings.
if hasattr(value, ‘__iter__’):
new_value = []
for v in value:
if (hasattr(v, ‘__iter__’) and not isinstance(v, bytes)
and not isinstance(v, str)):
# This is almost certainly the user’s mistake. In the
# interests of avoiding infinite loops, we’ll let
# it through as-is rather than doing a recursive call.
new_value.append(v)
else:
new_value.append(self._normalize_search_value(v))
return new_value
# Otherwise, convert it into a Unicode string.
# The unicode(str()) thing is so this will do the same thing on Python 2
# and Python 3.
return str(str(value))
def __str__(self):
if self.text:
return self.text
else:
return “%s|%s” % (self.name, self.attrs)
def search_tag(self, markup_name=None, markup_attrs={}):
found = None
markup = None
if isinstance(markup_name, Tag):
markup = markup_name
markup_attrs = markup
call_function_with_tag_data = (
isinstance(self.name, collections.Callable)
and not isinstance(markup_name, Tag))
if ((not self.name)
or call_function_with_tag_data
or (markup and self._matches(markup, self.name))
or (not markup and self._matches(markup_name, self.name))):
if call_function_with_tag_data:
match = self.name(markup_name, markup_attrs)
else:
match = True
markup_attr_map = None
for attr, match_against in list(self.attrs.items()):
if not markup_attr_map:
if hasattr(markup_attrs, ‘get’):
markup_attr_map = markup_attrs
else:
markup_attr_map = {}
for k, v in markup_attrs:
markup_attr_map[k] = v
attr_value = markup_attr_map.get(attr)
if not self._matches(attr_value, match_against):
match = False
break
if match:
if markup:
found = markup
else:
found = markup_name
if found and self.text and not self._matches(found.string, self.text):
found = None
return found
searchTag = search_tag
def search(self, markup):
# print ‘looking for %s in %s’ % (self, markup)
found = None
# If given a list of items, scan it for a text element that
# matches.
if hasattr(markup, ‘__iter__’) and not isinstance(markup, (Tag, str)):
for element in markup:
if isinstance(element, NavigableString) \
and self.search(element):
found = element
break
# If it’s a Tag, make sure its name or attributes match.
# Don’t bother with Tags if we’re searching for text.
elif isinstance(markup, Tag):
if not self.text or self.name or self.attrs:
found = self.search_tag(markup)
# If it’s text, make sure the text matches.
elif isinstance(markup, NavigableString) or \
isinstance(markup, str):
if not self.name and not self.attrs and self._matches(markup, self.text):
found = markup
else:
raise Exception(
“I don’t know how to match against a %s” % markup.__class__)
return found
def _matches(self, markup, match_against):
# print u”Matching %s against %s” % (markup, match_against)
result = False
if isinstance(markup, list) or isinstance(markup, tuple):
# This should only happen when searching a multi-valued attribute
# like ‘class’.
if (isinstance(match_against, str)
and ‘ ‘ in match_against):
# A bit of a special case. If they try to match “foo
# bar” on a multivalue attribute’s value, only accept
# the literal value “foo bar”
#
# XXX This is going to be pretty slow because we keep
# splitting match_against. But it shouldn’t come up
# too often.
return (whitespace_re.split(match_against) == markup)
else:
for item in markup:
if self._matches(item, match_against):
return True
return False
if match_against is True:
# True matches any non-None value.
return markup is not None
if isinstance(match_against, collections.Callable):
return match_against(markup)
# Custom callables take the tag as an argument, but all
# other ways of matching match the tag name as a string.
if isinstance(markup, Tag):
markup = markup.name
# Ensure that `markup` is either a Unicode string, or None.
markup = self._normalize_search_value(markup)
if markup is None:
# None matches None, False, an empty string, an empty list, and so on.
return not match_against
if isinstance(match_against, str):
# Exact string match
return markup == match_against
if hasattr(match_against, ‘match’):
# Regexp match
return match_against.search(markup)
if hasattr(match_against, ‘__iter__’):
# The markup must be an exact match against something
# in the iterable.
return markup in match_against
class ResultSet(list):
“””A ResultSet is just a list that keeps track of the SoupStrainer
that created it.”””
def __init__(self, source):
list.__init__([])
self.source = source
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/build/lib/bs4/testing.py”””Helper classes for tests.”””
import copy
import functools
import unittest
from unittest import TestCase
from bs4 import BeautifulSoup
from bs4.element import (
CharsetMetaAttributeValue,
Comment,
ContentMetaAttributeValue,
Doctype,
SoupStrainer,
)
from bs4.builder import HTMLParserTreeBuilder
default_builder = HTMLParserTreeBuilder
class SoupTest(unittest.TestCase):
@property
def default_builder(self):
return default_builder()
def soup(self, markup, **kwargs):
“””Build a Beautiful Soup object from markup.”””
builder = kwargs.pop(‘builder’, self.default_builder)
return BeautifulSoup(markup, builder=builder, **kwargs)
def document_for(self, markup):
“””Turn an HTML fragment into a document.
The details depend on the builder.
“””
return self.default_builder.test_fragment_to_document(markup)
def assertSoupEquals(self, to_parse, compare_parsed_to=None):
builder = self.default_builder
obj = BeautifulSoup(to_parse, builder=builder)
if compare_parsed_to is None:
compare_parsed_to = to_parse
self.assertEqual(obj.decode(), self ument_for(compare_parsed_to))
class HTMLTreeBuilderSmokeTest(object):
“””A basic test of a treebuilder’s competence.
Any HTML treebuilder, present or future, should be able to pass
these tests. With invalid markup, there’s room for interpretation,
and different parsers can handle it differently. But with the
markup in these tests, there’s not much room for interpretation.
“””
def assertDoctypeHandled(self, doctype_fragment):
“””Assert that a given doctype string is handled correctly.”””
doctype_str, soup = self._document_with_doctype(doctype_fragment)
# Make sure a Doctype object was created.
doctype = soup.contents[0]
self.assertEqual(doctype.__class__, Doctype)
self.assertEqual(doctype, doctype_fragment)
self.assertEqual(str(soup)[:len(doctype_str)], doctype_str)
# Make sure that the doctype was correctly associated with the
# parse tree and that the rest of the document parsed.
self.assertEqual(soup.p.contents[0], ‘foo’)
def _document_with_doctype(self, doctype_fragment):
“””Generate and parse a document with the given doctype.”””
doctype = ” % doctype_fragment
markup = doctype + ‘\nfoo
‘
soup = self.soup(markup)
return doctype, soup
def test_normal_doctypes(self):
“””Make sure normal, everyday HTML doctypes are handled correctly.”””
self.assertDoctypeHandled(“html”)
self.assertDoctypeHandled(
‘html PUBLIC “-//W3C//DTD XHTML 1.0 Transitional//EN”‘)
def test_public_doctype_with_url(self):
doctype = ‘html PUBLIC “-//W3C//DTD XHTML 1.0 Transitional//EN” “http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd”‘
self.assertDoctypeHandled(doctype)
def test_system_doctype(self):
self.assertDoctypeHandled(‘foo SYSTEM “http://www.example.com/”‘)
def test_namespaced_system_doctype(self):
# We can handle a namespaced doctype with a system ID.
self.assertDoctypeHandled(‘xsl:stylesheet SYSTEM “htmlent.dtd”‘)
def test_namespaced_public_doctype(self):
# Test a namespaced doctype with a public id.
self.assertDoctypeHandled(‘xsl:stylesheet PUBLIC “htmlent.dtd”‘)
def test_real_xhtml_document(self):
“””A real XHTML document should come out more or less the same as it went in.”””
markup = b”””
Goodbye.”””
soup = self.soup(markup)
self.assertEqual(
soup.encode(“utf-8″).replace(b”\n”, b””),
markup.replace(b”\n”, b””))
def test_deepcopy(self):
“””Make sure you can copy the tree builder.
This is important because the builder is part of a
BeautifulSoup object, and we want to be able to copy that.
“””
copy.deepcopy(self.default_builder)
def test_p_tag_is_never_empty_element(self):
“””A tag is never designated as an empty-element tag.
Even if the markup shows it as an empty-element tag, it
shouldn’t be presented that way.
“””
soup = self.soup(”
“)
self.assertFalse(soup.p.is_empty_element)
self.assertEqual(str(soup.p), ”
“)
def test_unclosed_tags_get_closed(self):
“””A tag that’s not closed by the end of the document should be closed.
This applies to all tags except empty-element tags.
“””
self.assertSoupEquals(“”, ”
“)
self.assertSoupEquals(“”, “”)
self.assertSoupEquals(”
“, ”
“)
def test_br_is_always_empty_element_tag(self):
“””A
tag is designated as an empty-element tag.
Some parsers treat
as one
tag, some parsers as
two tags, but it should always be an empty-element tag.
“””
soup = self.soup(”
“)
self.assertTrue(soup.br.is_empty_element)
self.assertEqual(str(soup.br), ”
“)
def test_nested_formatting_elements(self):
self.assertSoupEquals(“”)
def test_comment(self):
# Comments are represented as Comment objects.
markup = “foobaz
”
self.assertSoupEquals(markup)
soup = self.soup(markup)
comment = soup.find(text=”foobar”)
self.assertEqual(comment.__class__, Comment)
def test_preserved_whitespace_in_pre_and_textarea(self):
“””Whitespace must be preserved in and tags.”””
self.assertSoupEquals(”
“)
self.assertSoupEquals(” woo “)
def test_nested_inline_elements(self):
“””Inline elements can be nested indefinitely.”””
b_tag = “Inside a B tag”
self.assertSoupEquals(b_tag)
nested_b_tag = “A nested tag
”
self.assertSoupEquals(nested_b_tag)
double_nested_b_tag = “A doubly nested tag
”
self.assertSoupEquals(nested_b_tag)
def test_nested_block_level_elements(self):
“””Block elements can be nested.”””
soup = self.soup(‘
Foo
‘)
blockquote = soup.blockquote
self.assertEqual(blockquote.p.b.string, ‘Foo’)
self.assertEqual(blockquote.b.string, ‘Foo’)
def test_correctly_nested_tables(self):
“””One table can go inside another one.”””
markup = (‘
‘
‘
‘
”
Here’s another table:”
‘
‘
‘
foo
‘
”)
self.assertSoupEquals(
markup,
‘
Here\’s another table:’
‘
foo
‘
‘
‘)
self.assertSoupEquals(
”
Foo
”
”
Bar
”
”
Baz
“)
def test_angle_brackets_in_attribute_values_are_escaped(self):
self.assertSoupEquals(”, ”)
def test_entities_in_attributes_converted_to_unicode(self):
expect = ‘
‘
self.assertSoupEquals(‘
‘, expect)
self.assertSoupEquals(‘
‘, expect)
self.assertSoupEquals(‘
‘, expect)
def test_entities_in_text_converted_to_unicode(self):
expect = ‘pi\N{LATIN SMALL LETTER N WITH TILDE}ata
‘
self.assertSoupEquals(“piñata
“, expect)
self.assertSoupEquals(“piñata
“, expect)
self.assertSoupEquals(“piñata
“, expect)
def test_quot_entity_converted_to_quotation_mark(self):
self.assertSoupEquals(“I said “good day!”
“,
‘I said “good day!”
‘)
def test_out_of_range_entity(self):
expect = “\N{REPLACEMENT CHARACTER}”
self.assertSoupEquals(“”, expect)
self.assertSoupEquals(“”, expect)
self.assertSoupEquals(“빲�”, expect)
def test_basic_namespaces(self):
“””Parsers don’t need to *understand* namespaces, but at the
very least they should not choke on namespaces or lose
data.”””
markup = b’4’
soup = self.soup(markup)
self.assertEqual(markup, soup.encode())
html = soup.html
self.assertEqual(‘http://www.w3.org/1999/xhtml’, soup.html[‘xmlns’])
self.assertEqual(
‘http://www.w3.org/1998/Math/MathML’, soup.html[‘xmlns:mathml’])
self.assertEqual(
‘http://www.w3.org/2000/svg’, soup.html[‘xmlns:svg’])
def test_multivalued_attribute_value_becomes_list(self):
markup = b”
soup = self.soup(markup)
self.assertEqual([‘foo’, ‘bar’], soup.a[‘class’])
#
# Generally speaking, tests below this point are more tests of
# Beautiful Soup than tests of the tree builders. But parsers are
# weird, so we run these tests separately for every tree builder
# to detect any differences between them.
#
def test_soupstrainer(self):
“””Parsers should be able to work with SoupStrainers.”””
strainer = SoupStrainer(“b”)
soup = self.soup(“A bold statement”,
parse_only=strainer)
self.assertEqual(soup.decode(), “bold”)
def test_single_quote_attribute_values_become_double_quotes(self):
self.assertSoupEquals(“”,
”)
def test_attribute_values_with_nested_quotes_are_left_alone(self):
text = “””a”””
self.assertSoupEquals(text)
def test_attribute_values_with_double_nested_quotes_get_quoted(self):
text = “””a”””
soup = self.soup(text)
soup.foo[‘attr’] = ‘Brawls happen at “Bob\’s Bar”‘
self.assertSoupEquals(
soup.foo.decode(),
“””a”””)
def test_ampersand_in_attribute_value_gets_escaped(self):
self.assertSoupEquals(”,
”)
self.assertSoupEquals(
‘foo’,
‘foo’)
def test_escaped_ampersand_in_attribute_value_is_left_alone(self):
self.assertSoupEquals(”)
def test_entities_in_strings_converted_during_parsing(self):
# Both XML and HTML entities are converted to Unicode characters
# during parsing.
text = “<
”
expected = “<
”
self.assertSoupEquals(text, expected)
def test_smart_quotes_converted_on_the_way_in(self):
# Microsoft smart quotes are converted to Unicode characters during
# parsing.
quote = b”\x91Foo\x92
”
soup = self.soup(quote)
self.assertEqual(
soup.p.string,
“\N{LEFT SINGLE QUOTATION MARK}Foo\N{RIGHT SINGLE QUOTATION MARK}”)
def test_non_breaking_spaces_converted_on_the_way_in(self):
soup = self.soup(” “)
self.assertEqual(soup.a.string, “\N{NO-BREAK SPACE}” * 2)
def test_entities_converted_on_the_way_out(self):
text = “<
”
expected = “<
“.encode(“utf-8”)
soup = self.soup(text)
self.assertEqual(soup.p.encode(“utf-8”), expected)
def test_real_iso_latin_document(self):
# Smoke test of interrelated functionality, using an
# easy-to-understand document.
# Here it is in Unicode. Note that it claims to be in ISO-Latin-1.
unicode_html = ‘Sacr\N{LATIN SMALL LETTER E WITH ACUTE} bleu!
‘
# That’s because we’re going to encode it into ISO-Latin-1, and use
# that to test.
iso_latin_html = unicode_html.encode(“iso-8859-1”)
# Parse the ISO-Latin-1 HTML.
soup = self.soup(iso_latin_html)
# Encode it to UTF-8.
result = soup.encode(“utf-8”)
# What do we expect the result to look like? Well, it would
# look like unicode_html, except that the META tag would say
# UTF-8 instead of ISO-Latin-1.
expected = unicode_html.replace(“ISO-Latin-1”, “utf-8”)
# And, of course, it would be in UTF-8, not Unicode.
expected = expected.encode(“utf-8″)
# Ta-da!
self.assertEqual(result, expected)
def test_real_shift_jis_document(self):
# Smoke test to make sure the parser can handle a document in
# Shift-JIS encoding, without choking.
shift_jis_html = (
b”
b’\x82\xb1\x82\xea\x82\xcdShift-JIS\x82\xc5\x83R\x81[\x83f’
b’\x83B\x83\x93\x83O\x82\xb3\x82\xea\x82\xbd\x93\xfa\x96{\x8c’
b’\xea\x82\xcc\x83t\x83@\x83C\x83\x8b\x82\xc5\x82\xb7\x81B’
b’
‘)
unicode_html = shift_jis_html.decode(“shift-jis”)
soup = self.soup(unicode_html)
# Make sure the parse tree is correctly encoded to various
# encodings.
self.assertEqual(soup.encode(“utf-8”), unicode_html.encode(“utf-8”))
self.assertEqual(soup.encode(“euc_jp”), unicode_html.encode(“euc_jp”))
def test_real_hebrew_document(self):
# A real-world test to make sure we can convert ISO-8859-9 (a
# Hebrew encoding) to UTF-8.
hebrew_document = b’Hebrew (ISO 8859-8) in Visual Directionality
\xed\xe5\xec\xf9’
soup = self.soup(
hebrew_document, from_encoding=”iso8859-8″)
self.assertEqual(soup.original_encoding, ‘iso8859-8’)
self.assertEqual(
soup.encode(‘utf-8’),
hebrew_document.decode(“iso8859-8”).encode(“utf-8″))
def test_meta_tag_reflects_current_encoding(self):
# Here’s the tag saying that a document is
# encoded in Shift-JIS.
meta_tag = (”)
# Here’s a document incorporating that meta tag.
shift_jis_html = (
‘\n%s\n’
”
‘Shift-JIS markup goes here.’) % meta_tag
soup = self.soup(shift_jis_html)
# Parse the document, and the charset is seemingly unaffected.
parsed_meta = soup.find(‘meta’, {‘http-equiv’: ‘Content-type’})
content = parsed_meta[‘content’]
self.assertEqual(‘text/html; charset=x-sjis’, content)
# But that value is actually a ContentMetaAttributeValue object.
self.assertTrue(isinstance(content, ContentMetaAttributeValue))
# And it will take on a value that reflects its current
# encoding.
self.assertEqual(‘text/html; charset=utf8’, content.encode(“utf8″))
# For the rest of the story, see TestSubstitutions in
# test_tree.py.
def test_html5_style_meta_tag_reflects_current_encoding(self):
# Here’s the tag saying that a document is
# encoded in Shift-JIS.
meta_tag = (”)
# Here’s a document incorporating that meta tag.
shift_jis_html = (
‘\n%s\n’
”
‘Shift-JIS markup goes here.’) % meta_tag
soup = self.soup(shift_jis_html)
# Parse the document, and the charset is seemingly unaffected.
parsed_meta = soup.find(‘meta’, id=”encoding”)
charset = parsed_meta[‘charset’]
self.assertEqual(‘x-sjis’, charset)
# But that value is actually a CharsetMetaAttributeValue object.
self.assertTrue(isinstance(charset, CharsetMetaAttributeValue))
# And it will take on a value that reflects its current
# encoding.
self.assertEqual(‘utf8’, charset.encode(“utf8”))
def test_tag_with_no_attributes_can_have_attributes_added(self):
data = self.soup(“text”)
data.a[‘foo’] = ‘bar’
self.assertEqual(‘text’, data.a.decode())
class XMLTreeBuilderSmokeTest(object):
def test_docstring_generated(self):
soup = self.soup(“”)
self.assertEqual(
soup.encode(), b’\n’)
def test_real_xhtml_document(self):
“””A real XHTML document should come out *exactly* the same as it went in.”””
markup = b”””
Goodbye.”””
soup = self.soup(markup)
self.assertEqual(
soup.encode(“utf-8”), markup)
def test_docstring_includes_correct_encoding(self):
soup = self.soup(“”)
self.assertEqual(
soup.encode(“latin1”),
b’\n’)
def test_large_xml_document(self):
“””A large XML document should come out the same as it went in.”””
markup = (b’\n’
+ b’0′ * (2**12)
+ b”)
soup = self.soup(markup)
self.assertEqual(soup.encode(“utf-8”), markup)
def test_tags_are_empty_element_if_and_only_if_they_are_empty(self):
self.assertSoupEquals(“”, ”
“)
self.assertSoupEquals(”
foo
“)
def test_namespaces_are_preserved(self):
markup = ‘This tag is in the a namespaceThis tag is in the b namespace’
soup = self.soup(markup)
root = soup.root
self.assertEqual(“http://example.com/”, root[‘xmlns:a’])
self.assertEqual(“http://example.net/”, root[‘xmlns:b’])
class HTML5TreeBuilderSmokeTest(HTMLTreeBuilderSmokeTest):
“””Smoke test for a tree builder that supports HTML5.”””
def test_real_xhtml_document(self):
# Since XHTML is not HTML5, HTML5 parsers are not tested to handle
# XHTML documents in any particular way.
pass
def test_html_tags_have_namespace(self):
markup = “”
soup = self.soup(markup)
self.assertEqual(“http://www.w3.org/1999/xhtml”, soup.a.namespace)
def test_svg_tags_have_namespace(self):
markup = ”
soup = self.soup(markup)
namespace = “http://www.w3.org/2000/svg”
self.assertEqual(namespace, soup.svg.namespace)
self.assertEqual(namespace, soup.circle.namespace)
def test_mathml_tags_have_namespace(self):
markup = ‘5’
soup = self.soup(markup)
namespace = ‘http://www.w3.org/1998/Math/MathML’
self.assertEqual(namespace, soup.math.namespace)
self.assertEqual(namespace, soup.msqrt.namespace)
def skipIf(condition, reason):
def nothing(test, *args, **kwargs):
return None
def decorator(test_item):
if condition:
return nothing
else:
return test_item
return decorator
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/build/lib/bs4/tests/test_builder_registry.py
“””Tests of the builder registry.”””
import unittest
from bs4 import BeautifulSoup
from bs4.builder import (
builder_registry as registry,
HTMLParserTreeBuilder,
TreeBuilderRegistry,
)
try:
from bs4.builder import HTML5TreeBuilder
HTML5LIB_PRESENT = True
except ImportError:
HTML5LIB_PRESENT = False
try:
from bs4.builder import (
LXMLTreeBuilderForXML,
LXMLTreeBuilder,
)
LXML_PRESENT = True
except ImportError:
LXML_PRESENT = False
class BuiltInRegistryTest(unittest.TestCase):
“””Test the built-in registry with the default builders registered.”””
def test_combination(self):
if LXML_PRESENT:
self.assertEqual(registry.lookup(‘fast’, ‘html’),
LXMLTreeBuilder)
if LXML_PRESENT:
self.assertEqual(registry.lookup(‘permissive’, ‘xml’),
LXMLTreeBuilderForXML)
self.assertEqual(registry.lookup(‘strict’, ‘html’),
HTMLParserTreeBuilder)
if HTML5LIB_PRESENT:
self.assertEqual(registry.lookup(‘html5lib’, ‘html’),
HTML5TreeBuilder)
def test_lookup_by_markup_type(self):
if LXML_PRESENT:
self.assertEqual(registry.lookup(‘html’), LXMLTreeBuilder)
self.assertEqual(registry.lookup(‘xml’), LXMLTreeBuilderForXML)
else:
self.assertEqual(registry.lookup(‘xml’), None)
if HTML5LIB_PRESENT:
self.assertEqual(registry.lookup(‘html’), HTML5TreeBuilder)
else:
self.assertEqual(registry.lookup(‘html’), HTMLParserTreeBuilder)
def test_named_library(self):
if LXML_PRESENT:
self.assertEqual(registry.lookup(‘lxml’, ‘xml’),
LXMLTreeBuilderForXML)
self.assertEqual(registry.lookup(‘lxml’, ‘html’),
LXMLTreeBuilder)
if HTML5LIB_PRESENT:
self.assertEqual(registry.lookup(‘html5lib’),
HTML5TreeBuilder)
self.assertEqual(registry.lookup(‘html.parser’),
HTMLParserTreeBuilder)
def test_beautifulsoup_constructor_does_lookup(self):
# You can pass in a string.
BeautifulSoup(“”, features=”html”)
# Or a list of strings.
BeautifulSoup(“”, features=[“html”, “fast”])
# You’ll get an exception if BS can’t find an appropriate
# builder.
self.assertRaises(ValueError, BeautifulSoup,
“”, features=”no-such-feature”)
class RegistryTest(unittest.TestCase):
“””Test the TreeBuilderRegistry class in general.”””
def setUp(self):
self.registry = TreeBuilderRegistry()
def builder_for_features(self, *feature_list):
cls = type(‘Builder_’ + ‘_’.join(feature_list),
(object,), {‘features’ : feature_list})
self.registry.register(cls)
return cls
def test_register_with_no_features(self):
builder = self.builder_for_features()
# Since the builder advertises no features, you can’t find it
# by looking up features.
self.assertEqual(self.registry.lookup(‘foo’), None)
# But you can find it by doing a lookup with no features, if
# this happens to be the only registered builder.
self.assertEqual(self.registry.lookup(), builder)
def test_register_with_features_makes_lookup_succeed(self):
builder = self.builder_for_features(‘foo’, ‘bar’)
self.assertEqual(self.registry.lookup(‘foo’), builder)
self.assertEqual(self.registry.lookup(‘bar’), builder)
def test_lookup_fails_when_no_builder_implements_feature(self):
builder = self.builder_for_features(‘foo’, ‘bar’)
self.assertEqual(self.registry.lookup(‘baz’), None)
def test_lookup_gets_most_recent_registration_when_no_feature_specified(self):
builder1 = self.builder_for_features(‘foo’)
builder2 = self.builder_for_features(‘bar’)
self.assertEqual(self.registry.lookup(), builder2)
def test_lookup_fails_when_no_tree_builders_registered(self):
self.assertEqual(self.registry.lookup(), None)
def test_lookup_gets_most_recent_builder_supporting_all_features(self):
has_one = self.builder_for_features(‘foo’)
has_the_other = self.builder_for_features(‘bar’)
has_both_early = self.builder_for_features(‘foo’, ‘bar’, ‘baz’)
has_both_late = self.builder_for_features(‘foo’, ‘bar’, ‘quux’)
lacks_one = self.builder_for_features(‘bar’)
has_the_other = self.builder_for_features(‘foo’)
# There are two builders featuring ‘foo’ and ‘bar’, but
# the one that also features ‘quux’ was registered later.
self.assertEqual(self.registry.lookup(‘foo’, ‘bar’),
has_both_late)
# There is only one builder featuring ‘foo’, ‘bar’, and ‘baz’.
self.assertEqual(self.registry.lookup(‘foo’, ‘bar’, ‘baz’),
has_both_early)
def test_lookup_fails_when_cannot_reconcile_requested_features(self):
builder1 = self.builder_for_features(‘foo’, ‘bar’)
builder2 = self.builder_for_features(‘foo’, ‘baz’)
self.assertEqual(self.registry.lookup(‘bar’, ‘baz’), None)
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/build/lib/bs4/tests/test_docs.py
“Test harness for doctests.”
# pylint: disable-msg=E0611,W0142
__metaclass__ = type
__all__ = [
‘additional_tests’,
]
import atexit
import doctest
import os
#from pkg_resources import (
# resource_filename, resource_exists, resource_listdir, cleanup_resources)
import unittest
DOCTEST_FLAGS = (
doctest.ELLIPSIS |
doctest.NORMALIZE_WHITESPACE |
doctest.REPORT_NDIFF)
# def additional_tests():
# “Run the doc tests (README.txt and docs/*, if any exist)”
# doctest_files = [
# os.path.abspath(resource_filename(‘bs4’, ‘README.txt’))]
# if resource_exists(‘bs4’, ‘docs’):
# for name in resource_listdir(‘bs4’, ‘docs’):
# if name.endswith(‘.txt’):
# doctest_files.append(
# os.path.abspath(
# resource_filename(‘bs4’, ‘docs/%s’ % name)))
# kwargs = dict(module_relative=False, optionflags=DOCTEST_FLAGS)
# atexit.register(cleanup_resources)
# return unittest.TestSuite((
# doctest.DocFileSuite(*doctest_files, **kwargs)))
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/build/lib/bs4/tests/test_html5lib.py
“””Tests to ensure that the html5lib tree builder generates good trees.”””
import warnings
try:
from bs4.builder import HTML5TreeBuilder
HTML5LIB_PRESENT = True
except ImportError as e:
HTML5LIB_PRESENT = False
from bs4.element import SoupStrainer
from bs4.testing import (
HTML5TreeBuilderSmokeTest,
SoupTest,
skipIf,
)
@skipIf(
not HTML5LIB_PRESENT,
“html5lib seems not to be present, not testing its tree builder.”)
class HTML5LibBuilderSmokeTest(SoupTest, HTML5TreeBuilderSmokeTest):
“””See “HTML5TreeBuilderSmokeTest“.”””
@property
def default_builder(self):
return HTML5TreeBuilder()
def test_soupstrainer(self):
# The html5lib tree builder does not support SoupStrainers.
strainer = SoupStrainer(“b”)
markup = “
A bold statement.
”
with warnings.catch_warnings(record=True) as w:
soup = self.soup(markup, parse_only=strainer)
self.assertEqual(
soup.decode(), self ument_for(markup))
self.assertTrue(
“the html5lib tree builder doesn’t support parse_only” in
str(w[0].message))
def test_correctly_nested_tables(self):
“””html5lib inserts
markup = (‘
| Here’s another table:” ‘
|
| Here\’s another table:’ ‘
‘ |
‘)
self.assertSoupEquals(
“
| Foo |
| Bar |
| Baz |
“)
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/build/lib/bs4/tests/test_htmlparser.py
“””Tests to ensure that the html.parser tree builder generates good
trees.”””
from bs4.testing import SoupTest, HTMLTreeBuilderSmokeTest
from bs4.builder import HTMLParserTreeBuilder
class HTMLParserTreeBuilderSmokeTest(SoupTest, HTMLTreeBuilderSmokeTest):
@property
def default_builder(self):
return HTMLParserTreeBuilder()
def test_namespaced_system_doctype(self):
# html.parser can’t handle namespaced doctypes, so skip this one.
pass
def test_namespaced_public_doctype(self):
# html.parser can’t handle namespaced doctypes, so skip this one.
pass
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/build/lib/bs4/tests/test_lxml.py”””Tests to ensure that the lxml tree builder generates good trees.”””
import re
import warnings
try:
from bs4.builder import LXMLTreeBuilder, LXMLTreeBuilderForXML
LXML_PRESENT = True
except ImportError as e:
LXML_PRESENT = False
from bs4 import (
BeautifulSoup,
BeautifulStoneSoup,
)
from bs4.element import Comment, Doctype, SoupStrainer
from bs4.testing import skipIf
from bs4.tests import test_htmlparser
from bs4.testing import (
HTMLTreeBuilderSmokeTest,
XMLTreeBuilderSmokeTest,
SoupTest,
skipIf,
)
@skipIf(
not LXML_PRESENT,
“lxml seems not to be present, not testing its tree builder.”)
class LXMLTreeBuilderSmokeTest(SoupTest, HTMLTreeBuilderSmokeTest):
“””See “HTMLTreeBuilderSmokeTest“.”””
@property
def default_builder(self):
return LXMLTreeBuilder()
def test_out_of_range_entity(self):
self.assertSoupEquals(
“foobar
“, “foobar
“)
self.assertSoupEquals(
“foobar
“, “foobar
“)
self.assertSoupEquals(
“foo빲�bar
“, “foobar
“)
def test_beautifulstonesoup_is_xml_parser(self):
# Make sure that the deprecated BSS class uses an xml builder
# if one is installed.
with warnings.catch_warnings(record=False) as w:
soup = BeautifulStoneSoup(“”)
self.assertEqual(“”, str(soup.b))
def test_real_xhtml_document(self):
“””lxml strips the XML definition from an XHTML doc, which is fine.”””
markup = b”””
Goodbye.”””
soup = self.soup(markup)
self.assertEqual(
soup.encode(“utf-8″).replace(b”\n”, b”),
markup.replace(b’\n’, b”).replace(
b”, b”))
@skipIf(
not LXML_PRESENT,
“lxml seems not to be present, not testing its XML tree builder.”)
class LXMLXMLTreeBuilderSmokeTest(SoupTest, XMLTreeBuilderSmokeTest):
“””See “HTMLTreeBuilderSmokeTest“.”””
@property
def default_builder(self):
return LXMLTreeBuilderForXML()
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/build/lib/bs4/tests/test_soup.py# -*- coding: utf-8 -*-
“””Tests of Beautiful Soup as a whole.”””
import unittest
from bs4 import (
BeautifulSoup,
BeautifulStoneSoup,
)
from bs4.element import (
CharsetMetaAttributeValue,
ContentMetaAttributeValue,
SoupStrainer,
NamespacedAttribute,
)
import bs4.dammit
from bs4.dammit import EntitySubstitution, UnicodeDammit
from bs4.testing import (
SoupTest,
skipIf,
)
import warnings
try:
from bs4.builder import LXMLTreeBuilder, LXMLTreeBuilderForXML
LXML_PRESENT = True
except ImportError as e:
LXML_PRESENT = False
class TestDeprecatedConstructorArguments(SoupTest):
def test_parseOnlyThese_renamed_to_parse_only(self):
with warnings.catch_warnings(record=True) as w:
soup = self.soup(“”, parseOnlyThese=SoupStrainer(“b”))
msg = str(w[0].message)
self.assertTrue(“parseOnlyThese” in msg)
self.assertTrue(“parse_only” in msg)
self.assertEqual(b””, soup.encode())
def test_fromEncoding_renamed_to_from_encoding(self):
with warnings.catch_warnings(record=True) as w:
utf8 = b”\xc3\xa9″
soup = self.soup(utf8, fromEncoding=”utf8″)
msg = str(w[0].message)
self.assertTrue(“fromEncoding” in msg)
self.assertTrue(“from_encoding” in msg)
self.assertEqual(“utf8”, soup.original_encoding)
def test_unrecognized_keyword_argument(self):
self.assertRaises(
TypeError, self.soup, “”, no_such_argument=True)
@skipIf(
not LXML_PRESENT,
“lxml not present, not testing BeautifulStoneSoup.”)
def test_beautifulstonesoup(self):
with warnings.catch_warnings(record=True) as w:
soup = BeautifulStoneSoup(“”)
self.assertTrue(isinstance(soup, BeautifulSoup))
self.assertTrue(“BeautifulStoneSoup class is deprecated”)
class TestSelectiveParsing(SoupTest):
def test_parse_with_soupstrainer(self):
markup = “NoYes
NoYes Yes”
strainer = SoupStrainer(“b”)
soup = self.soup(markup, parse_only=strainer)
self.assertEqual(soup.encode(), b”YesYes Yes”)
class TestEntitySubstitution(unittest.TestCase):
“””Standalone tests of the EntitySubstitution class.”””
def setUp(self):
self.sub = EntitySubstitution
def test_simple_html_substitution(self):
# Unicode characters corresponding to named HTML entites
# are substituted, and no others.
s = “foo\u2200\N{SNOWMAN}\u00f5bar”
self.assertEqual(self.sub.substitute_html(s),
“foo∀\N{SNOWMAN}õbar”)
def test_smart_quote_substitution(self):
# MS smart quotes are a common source of frustration, so we
# give them a special test.
quotes = b”\x91\x92foo\x93\x94″
dammit = UnicodeDammit(quotes)
self.assertEqual(self.sub.substitute_html(dammit.markup),
“‘’foo“””)
def test_xml_converstion_includes_no_quotes_if_make_quoted_attribute_is_false(self):
s = ‘Welcome to “my bar”‘
self.assertEqual(self.sub.substitute_xml(s, False), s)
def test_xml_attribute_quoting_normally_uses_double_quotes(self):
self.assertEqual(self.sub.substitute_xml(“Welcome”, True),
‘”Welcome”‘)
self.assertEqual(self.sub.substitute_xml(“Bob’s Bar”, True),
‘”Bob\’s Bar”‘)
def test_xml_attribute_quoting_uses_single_quotes_when_value_contains_double_quotes(self):
s = ‘Welcome to “my bar”‘
self.assertEqual(self.sub.substitute_xml(s, True),
“‘Welcome to \”my bar\”‘”)
def test_xml_attribute_quoting_escapes_single_quotes_when_value_contains_both_single_and_double_quotes(self):
s = ‘Welcome to “Bob\’s Bar”‘
self.assertEqual(
self.sub.substitute_xml(s, True),
‘”Welcome to “Bob\’s Bar””‘)
def test_xml_quotes_arent_escaped_when_value_is_not_being_quoted(self):
quoted = ‘Welcome to “Bob\’s Bar”‘
self.assertEqual(self.sub.substitute_xml(quoted), quoted)
def test_xml_quoting_handles_angle_brackets(self):
self.assertEqual(
self.sub.substitute_xml(“foo”),
“foo
def test_xml_quoting_handles_ampersands(self):
self.assertEqual(self.sub.substitute_xml(“AT&T”), “AT&T”)
def test_xml_quoting_ignores_ampersands_when_they_are_part_of_an_entity(self):
self.assertEqual(
self.sub.substitute_xml(“ÁT&T”),
“ÁT&T”)
def test_quotes_not_html_substituted(self):
“””There’s no need to do this except inside attribute values.”””
text = ‘Bob\’s “bar”‘
self.assertEqual(self.sub.substitute_html(text), text)
class TestEncodingConversion(SoupTest):
# Test Beautiful Soup’s ability to decode and encode from various
# encodings.
def setUp(self):
super(TestEncodingConversion, self).setUp()
self.unicode_data = “Sacr\N{LATIN SMALL LETTER E WITH ACUTE} bleu!”
self.utf8_data = self.unicode_data.encode(“utf-8″)
# Just so you know what it looks like.
self.assertEqual(
self.utf8_data,
b”Sacr\xc3\xa9 bleu!”)
def test_ascii_in_unicode_out(self):
# ASCII input is converted to Unicode. The original_encoding
# attribute is set.
ascii = b”a”
soup_from_ascii = self.soup(ascii)
unicode_output = soup_from_ascii.decode()
self.assertTrue(isinstance(unicode_output, str))
self.assertEqual(unicode_output, self ument_for(ascii.decode()))
self.assertEqual(soup_from_ascii.original_encoding, “ascii”)
def test_unicode_in_unicode_out(self):
# Unicode input is left alone. The original_encoding attribute
# is not set.
soup_from_unicode = self.soup(self.unicode_data)
self.assertEqual(soup_from_unicode.decode(), self.unicode_data)
self.assertEqual(soup_from_unicode.foo.string, ‘Sacr\xe9 bleu!’)
self.assertEqual(soup_from_unicode.original_encoding, None)
def test_utf8_in_unicode_out(self):
# UTF-8 input is converted to Unicode. The original_encoding
# attribute is set.
soup_from_utf8 = self.soup(self.utf8_data)
self.assertEqual(soup_from_utf8.decode(), self.unicode_data)
self.assertEqual(soup_from_utf8.foo.string, ‘Sacr\xe9 bleu!’)
def test_utf8_out(self):
# The internal data structures can be encoded as UTF-8.
soup_from_unicode = self.soup(self.unicode_data)
self.assertEqual(soup_from_unicode.encode(‘utf-8’), self.utf8_data)
class TestUnicodeDammit(unittest.TestCase):
“””Standalone tests of Unicode, Dammit.”””
def test_smart_quotes_to_unicode(self):
markup = b”\x91\x92\x93\x94″
dammit = UnicodeDammit(markup)
self.assertEqual(
dammit.unicode_markup, “\u2018\u2019\u201c\u201d”)
def test_smart_quotes_to_xml_entities(self):
markup = b”\x91\x92\x93\x94″
dammit = UnicodeDammit(markup, smart_quotes_to=”xml”)
self.assertEqual(
dammit.unicode_markup, “‘’“””)
def test_smart_quotes_to_html_entities(self):
markup = b”\x91\x92\x93\x94″
dammit = UnicodeDammit(markup, smart_quotes_to=”html”)
self.assertEqual(
dammit.unicode_markup, “‘’“””)
def test_smart_quotes_to_ascii(self):
markup = b”\x91\x92\x93\x94″
dammit = UnicodeDammit(markup, smart_quotes_to=”ascii”)
self.assertEqual(
dammit.unicode_markup, “””””””””)
def test_detect_utf8(self):
utf8 = b”\xc3\xa9″
dammit = UnicodeDammit(utf8)
self.assertEqual(dammit.unicode_markup, ‘\xe9’)
self.assertEqual(dammit.original_encoding, ‘utf-8’)
def test_convert_hebrew(self):
hebrew = b”\xed\xe5\xec\xf9″
dammit = UnicodeDammit(hebrew, [“iso-8859-8″])
self.assertEqual(dammit.original_encoding, ‘iso-8859-8’)
self.assertEqual(dammit.unicode_markup, ‘\u05dd\u05d5\u05dc\u05e9’)
def test_dont_see_smart_quotes_where_there_are_none(self):
utf_8 = b”\343\202\261\343\203\274\343\202\277\343\202\244 Watch”
dammit = UnicodeDammit(utf_8)
self.assertEqual(dammit.original_encoding, ‘utf-8’)
self.assertEqual(dammit.unicode_markup.encode(“utf-8”), utf_8)
def test_ignore_inappropriate_codecs(self):
utf8_data = “Räksmörgås”.encode(“utf-8”)
dammit = UnicodeDammit(utf8_data, [“iso-8859-8”])
self.assertEqual(dammit.original_encoding, ‘utf-8’)
def test_ignore_invalid_codecs(self):
utf8_data = “Räksmörgås”.encode(“utf-8″)
for bad_encoding in [‘.utf8’, ‘…’, ‘utF—16.!’]:
dammit = UnicodeDammit(utf8_data, [bad_encoding])
self.assertEqual(dammit.original_encoding, ‘utf-8′)
def test_detect_html5_style_meta_tag(self):
for data in (
b”,
b””,
b””,
b””):
dammit = UnicodeDammit(data, is_html=True)
self.assertEqual(
“euc-jp”, dammit.original_encoding)
def test_last_ditch_entity_replacement(self):
# This is a UTF-8 document that contains bytestrings
# completely incompatible with UTF-8 (ie. encoded with some other
# encoding).
#
# Since there is no consistent encoding for the document,
# Unicode, Dammit will eventually encode the document as UTF-8
# and encode the incompatible characters as REPLACEMENT
# CHARACTER.
#
# If chardet is installed, it will detect that the document
# can be converted into ISO-8859-1 without errors. This happens
# to be the wrong encoding, but it is a consistent encoding, so the
# code we’re testing here won’t run.
#
# So we temporarily disable chardet if it’s present.
doc = b”””\357\273\277
\330\250\330\252\330\261
\310\322\321\220\312\321\355\344″””
chardet = bs4.dammit.chardet
try:
bs4.dammit.chardet = None
with warnings.catch_warnings(record=True) as w:
dammit = UnicodeDammit(doc)
self.assertEqual(True, dammit.contains_replacement_characters)
self.assertTrue(“\ufffd” in dammit.unicode_markup)
soup = BeautifulSoup(doc, “html.parser”)
self.assertTrue(soup.contains_replacement_characters)
msg = w[0].message
self.assertTrue(isinstance(msg, UnicodeWarning))
self.assertTrue(“Some characters could not be decoded” in str(msg))
finally:
bs4.dammit.chardet = chardet
def test_sniffed_xml_encoding(self):
# A document written in UTF-16LE will be converted by a different
# code path that sniffs the byte order markers.
data = b’\xff\xfe\x00\xe1\x00\xe9\x00\x00’
dammit = UnicodeDammit(data)
self.assertEqual(“áé”, dammit.unicode_markup)
self.assertEqual(“utf-16le”, dammit.original_encoding)
def test_detwingle(self):
# Here’s a UTF8 document.
utf8 = (“\N{SNOWMAN}” * 3).encode(“utf8”)
# Here’s a Windows-1252 document.
windows_1252 = (
“\N{LEFT DOUBLE QUOTATION MARK}Hi, I like Windows!”
“\N{RIGHT DOUBLE QUOTATION MARK}”).encode(“windows_1252”)
# Through some unholy alchemy, they’ve been stuck together.
doc = utf8 + windows_1252 + utf8
# The document can’t be turned into UTF-8:
self.assertRaises(UnicodeDecodeError, doc.decode, “utf8”)
# Unicode, Dammit thinks the whole document is Windows-1252,
# and decodes it into “☃☃☃“Hi, I like Windows!”☃☃☃”
# But if we run it through fix_embedded_windows_1252, it’s fixed:
fixed = UnicodeDammit.detwingle(doc)
self.assertEqual(
“☃☃☃“Hi, I like Windows!”☃☃☃”, fixed.decode(“utf8”))
def test_detwingle_ignores_multibyte_characters(self):
# Each of these characters has a UTF-8 representation ending
# in \x93. \x93 is a smart quote if interpreted as
# Windows-1252. But our code knows to skip over multibyte
# UTF-8 characters, so they’ll survive the process unscathed.
for tricky_unicode_char in (
“\N{LATIN SMALL LIGATURE OE}”, # 2-byte char ‘\xc5\x93’
“\N{LATIN SUBSCRIPT SMALL LETTER X}”, # 3-byte char ‘\xe2\x82\x93′
“\xf0\x90\x90\x93”, # This is a CJK character, not sure which one.
):
input = tricky_unicode_char.encode(“utf8”)
self.assertTrue(input.endswith(b’\x93’))
output = UnicodeDammit.detwingle(input)
self.assertEqual(output, input)
class TestNamedspacedAttribute(SoupTest):
def test_name_may_be_none(self):
a = NamespacedAttribute(“xmlns”, None)
self.assertEqual(a, “xmlns”)
def test_attribute_is_equivalent_to_colon_separated_string(self):
a = NamespacedAttribute(“a”, “b”)
self.assertEqual(“a:b”, a)
def test_attributes_are_equivalent_if_prefix_and_name_identical(self):
a = NamespacedAttribute(“a”, “b”, “c”)
b = NamespacedAttribute(“a”, “b”, “c”)
self.assertEqual(a, b)
# The actual namespace is not considered.
c = NamespacedAttribute(“a”, “b”, None)
self.assertEqual(a, c)
# But name and prefix are important.
d = NamespacedAttribute(“a”, “z”, “c”)
self.assertNotEqual(a, d)
e = NamespacedAttribute(“z”, “b”, “c”)
self.assertNotEqual(a, e)
class TestAttributeValueWithCharsetSubstitution(unittest.TestCase):
def test_content_meta_attribute_value(self):
value = CharsetMetaAttributeValue(“euc-jp”)
self.assertEqual(“euc-jp”, value)
self.assertEqual(“euc-jp”, value.original_value)
self.assertEqual(“utf8”, value.encode(“utf8”))
def test_content_meta_attribute_value(self):
value = ContentMetaAttributeValue(“text/html; charset=euc-jp”)
self.assertEqual(“text/html; charset=euc-jp”, value)
self.assertEqual(“text/html; charset=euc-jp”, value.original_value)
self.assertEqual(“text/html; charset=utf8”, value.encode(“utf8”))
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/build/lib/bs4/tests/test_tree.py# -*- coding: utf-8 -*-
“””Tests for Beautiful Soup’s tree traversal methods.
The tree traversal methods are the main advantage of using Beautiful
Soup over just using a parser.
Different parsers will build different Beautiful Soup trees given the
same markup, but all Beautiful Soup trees can be traversed with the
methods tested here.
“””
import copy
import pickle
import re
import warnings
from bs4 import BeautifulSoup
from bs4.builder import (
builder_registry,
HTMLParserTreeBuilder,
)
from bs4.element import (
CData,
Doctype,
NavigableString,
SoupStrainer,
Tag,
)
from bs4.testing import (
SoupTest,
skipIf,
)
XML_BUILDER_PRESENT = (builder_registry.lookup(“xml”) is not None)
LXML_PRESENT = (builder_registry.lookup(“lxml”) is not None)
class TreeTest(SoupTest):
def assertSelects(self, tags, should_match):
“””Make sure that the given tags have the correct text.
This is used in tests that define a bunch of tags, each
containing a single string, and then select certain strings by
some mechanism.
“””
self.assertEqual([tag.string for tag in tags], should_match)
def assertSelectsIDs(self, tags, should_match):
“””Make sure that the given tags have the correct IDs.
This is used in tests that define a bunch of tags, each
containing a single string, and then select certain strings by
some mechanism.
“””
self.assertEqual([tag[‘id’] for tag in tags], should_match)
class TestFind(TreeTest):
“””Basic tests of the find() method.
find() just calls find_all() with limit=1, so it’s not tested all
that thouroughly here.
“””
def test_find_tag(self):
soup = self.soup(“1234”)
self.assertEqual(soup.find(“b”).string, “2”)
def test_unicode_text_find(self):
soup = self.soup(‘Räksmörgås
‘)
self.assertEqual(soup.find(text=’Räksmörgås’), ‘Räksmörgås’)
class TestFindAll(TreeTest):
“””Basic tests of the find_all() method.”””
def test_find_all_text_nodes(self):
“””You can search the tree for text nodes.”””
soup = self.soup(“Foobar\xbb”)
# Exact match.
self.assertEqual(soup.find_all(text=”bar”), [“bar”])
# Match any of a number of strings.
self.assertEqual(
soup.find_all(text=[“Foo”, “bar”]), [“Foo”, “bar”])
# Match a regular expression.
self.assertEqual(soup.find_all(text=re.compile(‘.*’)),
[“Foo”, “bar”, ‘\xbb’])
# Match anything.
self.assertEqual(soup.find_all(text=True),
[“Foo”, “bar”, ‘\xbb’])
def test_find_all_limit(self):
“””You can limit the number of items returned by find_all.”””
soup = self.soup(“1
2
3
4
5”)
self.assertSelects(soup.find_all(‘a’, limit=3), [“1”, “2”, “3”])
self.assertSelects(soup.find_all(‘a’, limit=1), [“1”])
self.assertSelects(
soup.find_all(‘a’, limit=10), [“1”, “2”, “3”, “4”, “5”])
# A limit of 0 means no limit.
self.assertSelects(
soup.find_all(‘a’, limit=0), [“1”, “2”, “3”, “4”, “5”])
def test_calling_a_tag_is_calling_findall(self):
soup = self.soup(“123”)
self.assertSelects(soup(‘a’, limit=1), [“1″])
self.assertSelects(soup.b(id=”foo”), [“3”])
def test_find_all_with_self_referential_data_structure_does_not_cause_infinite_recursion(self):
soup = self.soup(“”)
# Create a self-referential list.
l = []
l.append(l)
# Without special code in _normalize_search_value, this would cause infinite
# recursion.
self.assertEqual([], soup.find_all(l))
class TestFindAllBasicNamespaces(TreeTest):
def test_find_by_namespaced_name(self):
soup = self.soup(‘4’)
self.assertEqual(“4”, soup.find(“mathml:msqrt”).string)
self.assertEqual(“a”, soup.find(attrs= { “svg:fill” : “red” }).name)
class TestFindAllByName(TreeTest):
“””Test ways of finding tags by tag name.”””
def setUp(self):
super(TreeTest, self).setUp()
self.tree = self.soup(“””
First tag.
Second tag.
Third Nested tag. tag.”””)
def test_find_all_by_tag_name(self):
# Find all the tags.
self.assertSelects(
self.tree.find_all(‘a’), [‘First tag.’, ‘Nested tag.’])
def test_find_all_by_name_and_text(self):
self.assertSelects(
self.tree.find_all(‘a’, text=’First tag.’), [‘First tag.’])
self.assertSelects(
self.tree.find_all(‘a’, text=True), [‘First tag.’, ‘Nested tag.’])
self.assertSelects(
self.tree.find_all(‘a’, text=re.compile(“tag”)),
[‘First tag.’, ‘Nested tag.’])
def test_find_all_on_non_root_element(self):
# You can call find_all on any node, not just the root.
self.assertSelects(self.tree.c.find_all(‘a’), [‘Nested tag.’])
def test_calling_element_invokes_find_all(self):
self.assertSelects(self.tree(‘a’), [‘First tag.’, ‘Nested tag.’])
def test_find_all_by_tag_strainer(self):
self.assertSelects(
self.tree.find_all(SoupStrainer(‘a’)),
[‘First tag.’, ‘Nested tag.’])
def test_find_all_by_tag_names(self):
self.assertSelects(
self.tree.find_all([‘a’, ‘b’]),
[‘First tag.’, ‘Second tag.’, ‘Nested tag.’])
def test_find_all_by_tag_dict(self):
self.assertSelects(
self.tree.find_all({‘a’ : True, ‘b’ : True}),
[‘First tag.’, ‘Second tag.’, ‘Nested tag.’])
def test_find_all_by_tag_re(self):
self.assertSelects(
self.tree.find_all(re.compile(‘^[ab]$’)),
[‘First tag.’, ‘Second tag.’, ‘Nested tag.’])
def test_find_all_with_tags_matching_method(self):
# You can define an oracle method that determines whether
# a tag matches the search.
def id_matches_name(tag):
return tag.name == tag.get(‘id’)
tree = self.soup(“””
Match 1.
Does not match.
Match 2.”””)
self.assertSelects(
tree.find_all(id_matches_name), [“Match 1.”, “Match 2.”])
class TestFindAllByAttribute(TreeTest):
def test_find_all_by_attribute_name(self):
# You can pass in keyword arguments to find_all to search by
# attribute.
tree = self.soup(“””
Matching a.
Non-matching Matching b.a.
“””)
self.assertSelects(tree.find_all(id=’first’),
[“Matching a.”, “Matching b.”])
def test_find_all_by_utf8_attribute_value(self):
peace = “םולש”.encode(“utf8″)
data = ”.encode(“utf8”)
soup = self.soup(data)
self.assertEqual([soup.a], soup.find_all(title=peace))
self.assertEqual([soup.a], soup.find_all(title=peace.decode(“utf8”)))
self.assertEqual([soup.a], soup.find_all(title=[peace, “something else”]))
def test_find_all_by_attribute_dict(self):
# You can pass in a dictionary as the argument ‘attrs’. This
# lets you search for attributes like ‘name’ (a fixed argument
# to find_all) and ‘class’ (a reserved word in Python.)
tree = self.soup(“””
Name match.
Class match.
Non-match.
A tag called ‘name1’.
“””)
# This doesn’t do what you want.
self.assertSelects(tree.find_all(name=’name1′),
[“A tag called ‘name1’.”])
# This does what you want.
self.assertSelects(tree.find_all(attrs={‘name’ : ‘name1′}),
[“Name match.”])
# Passing class=’class2’ would cause a syntax error.
self.assertSelects(tree.find_all(attrs={‘class’ : ‘class2’}),
[“Class match.”])
def test_find_all_by_class(self):
# Passing in a string to ‘attrs’ will search the CSS class.
tree = self.soup(“””
Class 1.
Class 2.
Class 1.
Class 3 and 4.
“””)
self.assertSelects(tree.find_all(‘a’, ‘1’), [‘Class 1.’])
self.assertSelects(tree.find_all(attrs=’1′), [‘Class 1.’, ‘Class 1.’])
self.assertSelects(tree.find_all(‘c’, ‘3’), [‘Class 3 and 4.’])
self.assertSelects(tree.find_all(‘c’, ‘4’), [‘Class 3 and 4.’])
def test_find_by_class_when_multiple_classes_present(self):
tree = self.soup(“Found it”)
attrs = { ‘class’ : re.compile(“o”) }
f = tree.find_all(“gar”, attrs=attrs)
self.assertSelects(f, [“Found it”])
f = tree.find_all(“gar”, re.compile(“a”))
self.assertSelects(f, [“Found it”])
# Since the class is not the string “foo bar”, but the two
# strings “foo” and “bar”, this will not find anything.
attrs = { ‘class’ : re.compile(“o b”) }
f = tree.find_all(“gar”, attrs=attrs)
self.assertSelects(f, [])
def test_find_all_with_non_dictionary_for_attrs_finds_by_class(self):
soup = self.soup(“Found it”)
self.assertSelects(soup.find_all(“a”, re.compile(“ba”)), [“Found it”])
def big_attribute_value(value):
return len(value) > 3
self.assertSelects(soup.find_all(“a”, big_attribute_value), [])
def small_attribute_value(value):
return len(value) ‘)
a, a2 = soup.find_all(“a”)
self.assertEqual([a, a2], soup.find_all(“a”, “foo”))
self.assertEqual([a], soup.find_all(“a”, “bar”))
# If you specify the attribute as a string that contains a
# space, only that specific value will be found.
self.assertEqual([a], soup.find_all(“a”, “foo bar”))
self.assertEqual([], soup.find_all(“a”, “bar foo”))
def test_find_all_by_attribute_soupstrainer(self):
tree = self.soup(“””
Match.
Non-match.”””)
strainer = SoupStrainer(attrs={‘id’ : ‘first’})
self.assertSelects(tree.find_all(strainer), [‘Match.’])
def test_find_all_with_missing_atribute(self):
# You can pass in None as the value of an attribute to find_all.
# This will match tags that do not have that attribute set.
tree = self.soup(“””ID present.
No ID present.
ID is empty.”””)
self.assertSelects(tree.find_all(‘a’, id=None), [“No ID present.”])
def test_find_all_with_defined_attribute(self):
# You can pass in None as the value of an attribute to find_all.
# This will match tags that have that attribute set to any value.
tree = self.soup(“””ID present.
No ID present.
ID is empty.”””)
self.assertSelects(
tree.find_all(id=True), [“ID present.”, “ID is empty.”])
def test_find_all_with_numeric_attribute(self):
# If you search for a number, it’s treated as a string.
tree = self.soup(“””Unquoted attribute.
Quoted attribute.”””)
expected = [“Unquoted attribute.”, “Quoted attribute.”]
self.assertSelects(tree.find_all(id=1), expected)
self.assertSelects(tree.find_all(id=”1″), expected)
def test_find_all_with_list_attribute_values(self):
# You can pass a list of attribute values instead of just one,
# and you’ll get tags that match any of the values.
tree = self.soup(“””1
2
3
No ID.”””)
self.assertSelects(tree.find_all(id=[“1”, “3”, “4”]),
[“1”, “3”])
def test_find_all_with_regular_expression_attribute_value(self):
# You can pass a regular expression as an attribute value, and
# you’ll get tags whose values for that attribute match the
# regular expression.
tree = self.soup(“””One a.
Two as.
Mixed as and bs.
One b.
No ID.”””)
self.assertSelects(tree.find_all(id=re.compile(“^a+$”)),
[“One a.”, “Two as.”])
def test_find_by_name_and_containing_string(self):
soup = self.soup(“foobarfoo”)
a = soup.a
self.assertEqual([a], soup.find_all(“a”, text=”foo”))
self.assertEqual([], soup.find_all(“a”, text=”bar”))
self.assertEqual([], soup.find_all(“a”, text=”bar”))
def test_find_by_name_and_containing_string_when_string_is_buried(self):
soup = self.soup(“foo
foo”)
self.assertEqual(soup.find_all(“a”), soup.find_all(“a”, text=”foo”))
def test_find_by_attribute_and_containing_string(self):
soup = self.soup(‘foofoo’)
a = soup.a
self.assertEqual([a], soup.find_all(id=2, text=”foo”))
self.assertEqual([], soup.find_all(id=1, text=”bar”))
class TestIndex(TreeTest):
“””Test Tag.index”””
def test_index(self):
tree = self.soup(“””
Identical
Not identical
Identical
Identical with child
Also not identical
Identical with child
“””)
div = tree.div
for i, element in enumerate(div.contents):
self.assertEqual(i, div.index(element))
self.assertRaises(ValueError, tree.index, 1)
class TestParentOperations(TreeTest):
“””Test navigation and searching through an element’s parents.”””
def setUp(self):
super(TestParentOperations, self).setUp()
self.tree = self.soup(”’
Start here
”’)
self.start = self.tree.b
def test_parent(self):
self.assertEqual(self.start.parent[‘id’], ‘bottom’)
self.assertEqual(self.start.parent.parent[‘id’], ‘middle’)
self.assertEqual(self.start.parent.parent.parent[‘id’], ‘top’)
def test_parent_of_top_tag_is_soup_object(self):
top_tag = self.tree.contents[0]
self.assertEqual(top_tag.parent, self.tree)
def test_soup_object_has_no_parent(self):
self.assertEqual(None, self.tree.parent)
def test_find_parents(self):
self.assertSelectsIDs(
self.start.find_parents(‘ul’), [‘bottom’, ‘middle’, ‘top’])
self.assertSelectsIDs(
self.start.find_parents(‘ul’, id=”middle”), [‘middle’])
def test_find_parent(self):
self.assertEqual(self.start.find_parent(‘ul’)[‘id’], ‘bottom’)
def test_parent_of_text_element(self):
text = self.tree.find(text=”Start here”)
self.assertEqual(text.parent.name, ‘b’)
def test_text_element_find_parent(self):
text = self.tree.find(text=”Start here”)
self.assertEqual(text.find_parent(‘ul’)[‘id’], ‘bottom’)
def test_parent_generator(self):
parents = [parent[‘id’] for parent in self.start.parents
if parent is not None and ‘id’ in parent.attrs]
self.assertEqual(parents, [‘bottom’, ‘middle’, ‘top’])
class ProximityTest(TreeTest):
def setUp(self):
super(TreeTest, self).setUp()
self.tree = self.soup(
‘OneTwoThree’)
class TestNextOperations(ProximityTest):
def setUp(self):
super(TestNextOperations, self).setUp()
self.start = self.tree.b
def test_next(self):
self.assertEqual(self.start.next_element, “One”)
self.assertEqual(self.start.next_element.next_element[‘id’], “2”)
def test_next_of_last_item_is_none(self):
last = self.tree.find(text=”Three”)
self.assertEqual(last.next_element, None)
def test_next_of_root_is_none(self):
# The document root is outside the next/previous chain.
self.assertEqual(self.tree.next_element, None)
def test_find_all_next(self):
self.assertSelects(self.start.find_all_next(‘b’), [“Two”, “Three”])
self.start.find_all_next(id=3)
self.assertSelects(self.start.find_all_next(id=3), [“Three”])
def test_find_next(self):
self.assertEqual(self.start.find_next(‘b’)[‘id’], ‘2’)
self.assertEqual(self.start.find_next(text=”Three”), “Three”)
def test_find_next_for_text_element(self):
text = self.tree.find(text=”One”)
self.assertEqual(text.find_next(“b”).string, “Two”)
self.assertSelects(text.find_all_next(“b”), [“Two”, “Three”])
def test_next_generator(self):
start = self.tree.find(text=”Two”)
successors = [node for node in start.next_elements]
# There are two successors: the final tag and its text contents.
tag, contents = successors
self.assertEqual(tag[‘id’], ‘3’)
self.assertEqual(contents, “Three”)
class TestPreviousOperations(ProximityTest):
def setUp(self):
super(TestPreviousOperations, self).setUp()
self.end = self.tree.find(text=”Three”)
def test_previous(self):
self.assertEqual(self.end.previous_element[‘id’], “3”)
self.assertEqual(self.end.previous_element.previous_element, “Two”)
def test_previous_of_first_item_is_none(self):
first = self.tree.find(‘html’)
self.assertEqual(first.previous_element, None)
def test_previous_of_root_is_none(self):
# The document root is outside the next/previous chain.
# XXX This is broken!
#self.assertEqual(self.tree.previous_element, None)
pass
def test_find_all_previous(self):
# The tag containing the “Three” node is the predecessor
# of the “Three” node itself, which is why “Three” shows up
# here.
self.assertSelects(
self.end.find_all_previous(‘b’), [“Three”, “Two”, “One”])
self.assertSelects(self.end.find_all_previous(id=1), [“One”])
def test_find_previous(self):
self.assertEqual(self.end.find_previous(‘b’)[‘id’], ‘3’)
self.assertEqual(self.end.find_previous(text=”One”), “One”)
def test_find_previous_for_text_element(self):
text = self.tree.find(text=”Three”)
self.assertEqual(text.find_previous(“b”).string, “Three”)
self.assertSelects(
text.find_all_previous(“b”), [“Three”, “Two”, “One”])
def test_previous_generator(self):
start = self.tree.find(text=”One”)
predecessors = [node for node in start.previous_elements]
# There are four predecessors: the tag containing “One”
# the tag, the tag, and the tag.
b, body, head, html = predecessors
self.assertEqual(b[‘id’], ‘1’)
self.assertEqual(body.name, “body”)
self.assertEqual(head.name, “head”)
self.assertEqual(html.name, “html”)
class SiblingTest(TreeTest):
def setUp(self):
super(SiblingTest, self).setUp()
markup = ”’
”’
# All that whitespace looks good but makes the tests more
# difficult. Get rid of it.
markup = re.compile(“\n\s*”).sub(“”, markup)
self.tree = self.soup(markup)
class TestNextSibling(SiblingTest):
def setUp(self):
super(TestNextSibling, self).setUp()
self.start = self.tree.find(id=”1″)
def test_next_sibling_of_root_is_none(self):
self.assertEqual(self.tree.next_sibling, None)
def test_next_sibling(self):
self.assertEqual(self.start.next_sibling[‘id’], ‘2’)
self.assertEqual(self.start.next_sibling.next_sibling[‘id’], ‘3’)
# Note the difference between next_sibling and next_element.
self.assertEqual(self.start.next_element[‘id’], ‘1.1’)
def test_next_sibling_may_not_exist(self):
self.assertEqual(self.tree.html.next_sibling, None)
nested_span = self.tree.find(id=”1.1″)
self.assertEqual(nested_span.next_sibling, None)
last_span = self.tree.find(id=”4″)
self.assertEqual(last_span.next_sibling, None)
def test_find_next_sibling(self):
self.assertEqual(self.start.find_next_sibling(‘span’)[‘id’], ‘2’)
def test_next_siblings(self):
self.assertSelectsIDs(self.start.find_next_siblings(“span”),
[‘2’, ‘3’, ‘4’])
self.assertSelectsIDs(self.start.find_next_siblings(id=’3′), [‘3’])
def test_next_sibling_for_text_element(self):
soup = self.soup(“Foobarbaz”)
start = soup.find(text=”Foo”)
self.assertEqual(start.next_sibling.name, ‘b’)
self.assertEqual(start.next_sibling.next_sibling, ‘baz’)
self.assertSelects(start.find_next_siblings(‘b’), [‘bar’])
self.assertEqual(start.find_next_sibling(text=”baz”), “baz”)
self.assertEqual(start.find_next_sibling(text=”nonesuch”), None)
class TestPreviousSibling(SiblingTest):
def setUp(self):
super(TestPreviousSibling, self).setUp()
self.end = self.tree.find(id=”4″)
def test_previous_sibling_of_root_is_none(self):
self.assertEqual(self.tree.previous_sibling, None)
def test_previous_sibling(self):
self.assertEqual(self.end.previous_sibling[‘id’], ‘3’)
self.assertEqual(self.end.previous_sibling.previous_sibling[‘id’], ‘2’)
# Note the difference between previous_sibling and previous_element.
self.assertEqual(self.end.previous_element[‘id’], ‘3.1’)
def test_previous_sibling_may_not_exist(self):
self.assertEqual(self.tree.html.previous_sibling, None)
nested_span = self.tree.find(id=”1.1″)
self.assertEqual(nested_span.previous_sibling, None)
first_span = self.tree.find(id=”1″)
self.assertEqual(first_span.previous_sibling, None)
def test_find_previous_sibling(self):
self.assertEqual(self.end.find_previous_sibling(‘span’)[‘id’], ‘3’)
def test_previous_siblings(self):
self.assertSelectsIDs(self.end.find_previous_siblings(“span”),
[‘3’, ‘2’, ‘1’])
self.assertSelectsIDs(self.end.find_previous_siblings(id=’1′), [‘1’])
def test_previous_sibling_for_text_element(self):
soup = self.soup(“Foobarbaz”)
start = soup.find(text=”baz”)
self.assertEqual(start.previous_sibling.name, ‘b’)
self.assertEqual(start.previous_sibling.previous_sibling, ‘Foo’)
self.assertSelects(start.find_previous_siblings(‘b’), [‘bar’])
self.assertEqual(start.find_previous_sibling(text=”Foo”), “Foo”)
self.assertEqual(start.find_previous_sibling(text=”nonesuch”), None)
class TestTagCreation(SoupTest):
“””Test the ability to create new tags.”””
def test_new_tag(self):
soup = self.soup(“”)
new_tag = soup.new_tag(“foo”, bar=”baz”)
self.assertTrue(isinstance(new_tag, Tag))
self.assertEqual(“foo”, new_tag.name)
self.assertEqual(dict(bar=”baz”), new_tag.attrs)
self.assertEqual(None, new_tag.parent)
def test_tag_inherits_self_closing_rules_from_builder(self):
if XML_BUILDER_PRESENT:
xml_soup = BeautifulSoup(“”, “xml”)
xml_br = xml_soup.new_tag(“br”)
xml_p = xml_soup.new_tag(“p”)
# Both the
and tag are empty-element, just because
# they have no contents.
self.assertEqual(b”
“, xml_br.encode())
self.assertEqual(b”
“, xml_p.encode())
html_soup = BeautifulSoup(“”, “html”)
html_br = html_soup.new_tag(“br”)
html_p = html_soup.new_tag(“p”)
# The HTML builder users HTML’s rules about which tags are
# empty-element tags, and the new tags reflect these rules.
self.assertEqual(b”
“, html_br.encode())
self.assertEqual(b”
“, html_p.encode())
def test_new_string_creates_navigablestring(self):
soup = self.soup(“”)
s = soup.new_string(“foo”)
self.assertEqual(“foo”, s)
self.assertTrue(isinstance(s, NavigableString))
class TestTreeModification(SoupTest):
def test_attribute_modification(self):
soup = self.soup(”)
soup.a[‘id’] = 2
self.assertEqual(soup.decode(), self ument_for(”))
del(soup.a[‘id’])
self.assertEqual(soup.decode(), self ument_for(”))
soup.a[‘id2’] = ‘foo’
self.assertEqual(soup.decode(), self ument_for(”))
def test_new_tag_creation(self):
builder = builder_registry.lookup(‘html’)()
soup = self.soup(“”, builder=builder)
a = Tag(soup, builder, ‘a’)
ol = Tag(soup, builder, ‘ol’)
a[‘href’] = ‘http://foo.com/’
soup.body.insert(0, a)
soup.body.insert(1, ol)
self.assertEqual(
soup.body.encode(),
b’
‘)
def test_append_to_contents_moves_tag(self):
doc = “””Don’t leave me here.
Don\’t leave!
“””
soup = self.soup(doc)
second_para = soup.find(id=’2′)
bold = soup.b
# Move the tag to the end of the second paragraph.
soup.find(id=’2’).append(soup.b)
# The tag is now a child of the second paragraph.
self.assertEqual(bold.parent, second_para)
self.assertEqual(
soup.decode(), self ument_for(
‘Don\’t leave me .
\n’
‘Don\’t leave!here
‘))
def test_replace_with_returns_thing_that_was_replaced(self):
text = “”
soup = self.soup(text)
a = soup.a
new_a = a.replace_with(soup.c)
self.assertEqual(a, new_a)
def test_unwrap_returns_thing_that_was_replaced(self):
text = “”
soup = self.soup(text)
a = soup.a
new_a = a.unwrap()
self.assertEqual(a, new_a)
def test_replace_tag_with_itself(self):
text = “Foo
”
soup = self.soup(text)
c = soup.c
soup.c.replace_with(c)
self.assertEqual(soup.decode(), self ument_for(text))
def test_replace_tag_with_its_parent_raises_exception(self):
text = “”
soup = self.soup(text)
self.assertRaises(ValueError, soup.b.replace_with, soup.a)
def test_insert_tag_into_itself_raises_exception(self):
text = “”
soup = self.soup(text)
self.assertRaises(ValueError, soup.a.insert, 0, soup.a)
def test_replace_with_maintains_next_element_throughout(self):
soup = self.soup(‘
onethree
‘)
a = soup.a
b = a.contents[0]
# Make it so the tag has two text children.
a.insert(1, “two”)
# Now replace each one with the empty string.
left, right = a.contents
left.replaceWith(”)
right.replaceWith(”)
# The tag is still connected to the tree.
self.assertEqual(“three”, soup.b.string)
def test_replace_final_node(self):
soup = self.soup(“Argh!”)
soup.find(text=”Argh!”).replace_with(“Hooray!”)
new_text = soup.find(text=”Hooray!”)
b = soup.b
self.assertEqual(new_text.previous_element, b)
self.assertEqual(new_text.parent, b)
self.assertEqual(new_text.previous_element.next_element, new_text)
self.assertEqual(new_text.next_element, None)
def test_consecutive_text_nodes(self):
# A builder should never create two consecutive text nodes,
# but if you insert one next to another, Beautiful Soup will
# handle it correctly.
soup = self.soup(“Argh!”)
soup.b.insert(1, “Hooray!”)
self.assertEqual(
soup.decode(), self ument_for(
“Argh!Hooray!”))
new_text = soup.find(text=”Hooray!”)
self.assertEqual(new_text.previous_element, “Argh!”)
self.assertEqual(new_text.previous_element.next_element, new_text)
self.assertEqual(new_text.previous_sibling, “Argh!”)
self.assertEqual(new_text.previous_sibling.next_sibling, new_text)
self.assertEqual(new_text.next_sibling, None)
self.assertEqual(new_text.next_element, soup.c)
def test_insert_string(self):
soup = self.soup(“”)
soup.a.insert(0, “bar”)
soup.a.insert(0, “foo”)
# The string were added to the tag.
self.assertEqual([“foo”, “bar”], soup.a.contents)
# And they were converted to NavigableStrings.
self.assertEqual(soup.a.contents[0].next_element, “bar”)
def test_insert_tag(self):
builder = self.default_builder
soup = self.soup(
“Findlady!”, builder=builder)
magic_tag = Tag(soup, builder, ‘magictag’)
magic_tag.insert(0, “the”)
soup.a.insert(1, magic_tag)
self.assertEqual(
soup.decode(), self ument_for(
“Findthelady!”))
# Make sure all the relationships are hooked up correctly.
b_tag = soup.b
self.assertEqual(b_tag.next_sibling, magic_tag)
self.assertEqual(magic_tag.previous_sibling, b_tag)
find = b_tag.find(text=”Find”)
self.assertEqual(find.next_element, magic_tag)
self.assertEqual(magic_tag.previous_element, find)
c_tag = soup.c
self.assertEqual(magic_tag.next_sibling, c_tag)
self.assertEqual(c_tag.previous_sibling, magic_tag)
the = magic_tag.find(text=”the”)
self.assertEqual(the.parent, magic_tag)
self.assertEqual(the.next_element, c_tag)
self.assertEqual(c_tag.previous_element, the)
def test_append_child_thats_already_at_the_end(self):
data = “”
soup = self.soup(data)
soup.a.append(soup.b)
self.assertEqual(data, soup.decode())
def test_move_tag_to_beginning_of_parent(self):
data = “”
soup = self.soup(data)
soup.a.insert(0, soup.d)
self.assertEqual(“”, soup.decode())
def test_insert_works_on_empty_element_tag(self):
# This is a little strange, since most HTML parsers don’t allow
# markup like this to come through. But in general, we don’t
# know what the parser would or wouldn’t have allowed, so
# I’m letting this succeed for now.
soup = self.soup(”
“)
soup.br.insert(1, “Contents”)
self.assertEqual(str(soup.br), ”
Contents”)
def test_insert_before(self):
soup = self.soup(“foobar”)
soup.b.insert_before(“BAZ”)
soup.a.insert_before(“QUUX”)
self.assertEqual(
soup.decode(), self ument_for(“QUUXfooBAZbar”))
soup.a.insert_before(soup.b)
self.assertEqual(
soup.decode(), self ument_for(“QUUXbarfooBAZ”))
def test_insert_after(self):
soup = self.soup(“foobar”)
soup.b.insert_after(“BAZ”)
soup.a.insert_after(“QUUX”)
self.assertEqual(
soup.decode(), self ument_for(“fooQUUXbarBAZ”))
soup.b.insert_after(soup.a)
self.assertEqual(
soup.decode(), self ument_for(“QUUXbarfooBAZ”))
def test_insert_after_raises_valueerror_if_after_has_no_meaning(self):
soup = self.soup(“”)
tag = soup.new_tag(“a”)
string = soup.new_string(“”)
self.assertRaises(ValueError, string.insert_after, tag)
self.assertRaises(ValueError, soup.insert_after, tag)
self.assertRaises(ValueError, tag.insert_after, tag)
def test_insert_before_raises_valueerror_if_before_has_no_meaning(self):
soup = self.soup(“”)
tag = soup.new_tag(“a”)
string = soup.new_string(“”)
self.assertRaises(ValueError, string.insert_before, tag)
self.assertRaises(ValueError, soup.insert_before, tag)
self.assertRaises(ValueError, tag.insert_before, tag)
def test_replace_with(self):
soup = self.soup(
”
There’s no business like show business
“)
no, show = soup.find_all(‘b’)
show.replace_with(no)
self.assertEqual(
soup.decode(),
self ument_for(
“There’s business like no business
“))
self.assertEqual(show.parent, None)
self.assertEqual(no.parent, soup.p)
self.assertEqual(no.next_element, “no”)
self.assertEqual(no.next_sibling, ” business”)
def test_replace_first_child(self):
data = “”
soup = self.soup(data)
soup.b.replace_with(soup.c)
self.assertEqual(“”, soup.decode())
def test_replace_last_child(self):
data = “”
soup = self.soup(data)
soup.c.replace_with(soup.b)
self.assertEqual(“”, soup.decode())
def test_nested_tag_replace_with(self):
soup = self.soup(
“””Wereservetherighttorefuseservice”””)
# Replace the entire tag and its contents (“reserve the
# right”) with the tag (“refuse”).
remove_tag = soup.b
move_tag = soup.f
remove_tag.replace_with(move_tag)
self.assertEqual(
soup.decode(), self ument_for(
“Werefusetoservice”))
# The tag is now an orphan.
self.assertEqual(remove_tag.parent, None)
self.assertEqual(remove_tag.find(text=”right”).next_element, None)
self.assertEqual(remove_tag.previous_element, None)
self.assertEqual(remove_tag.next_sibling, None)
self.assertEqual(remove_tag.previous_sibling, None)
# The tag is now connected to the tag.
self.assertEqual(move_tag.parent, soup.a)
self.assertEqual(move_tag.previous_element, “We”)
self.assertEqual(move_tag.next_element.next_element, soup.e)
self.assertEqual(move_tag.next_sibling, None)
# The gap where the tag used to be has been mended, and
# the word “to” is now connected to the tag.
to_text = soup.find(text=”to”)
g_tag = soup.g
self.assertEqual(to_text.next_element, g_tag)
self.assertEqual(to_text.next_sibling, g_tag)
self.assertEqual(g_tag.previous_element, to_text)
self.assertEqual(g_tag.previous_sibling, to_text)
def test_unwrap(self):
tree = self.soup(“””
Unneeded formatting is unneeded
“””)
tree.em.unwrap()
self.assertEqual(tree.em, None)
self.assertEqual(tree.p.text, “Unneeded formatting is unneeded”)
def test_wrap(self):
soup = self.soup(“I wish I was bold.”)
value = soup.string.wrap(soup.new_tag(“b”))
self.assertEqual(value.decode(), “I wish I was bold.”)
self.assertEqual(
soup.decode(), self ument_for(“I wish I was bold.”))
def test_wrap_extracts_tag_from_elsewhere(self):
soup = self.soup(“I wish I was bold.”)
soup.b.next_sibling.wrap(soup.b)
self.assertEqual(
soup.decode(), self ument_for(“I wish I was bold.”))
def test_wrap_puts_new_contents_at_the_end(self):
soup = self.soup(“I like being bold.I wish I was bold.”)
soup.b.next_sibling.wrap(soup.b)
self.assertEqual(2, len(soup.b.contents))
self.assertEqual(
soup.decode(), self ument_for(
“I like being bold.I wish I was bold.”))
def test_extract(self):
soup = self.soup(
‘Some content. Nav crap
More content.’)
self.assertEqual(len(soup.body.contents), 3)
extracted = soup.find(id=”nav”).extract()
self.assertEqual(
soup.decode(), “Some content. More content.”)
self.assertEqual(extracted.decode(), ‘Nav crap
‘)
# The extracted tag is now an orphan.
self.assertEqual(len(soup.body.contents), 2)
self.assertEqual(extracted.parent, None)
self.assertEqual(extracted.previous_element, None)
self.assertEqual(extracted.next_element.next_element, None)
# The gap where the extracted tag used to be has been mended.
content_1 = soup.find(text=”Some content. “)
content_2 = soup.find(text=” More content.”)
self.assertEqual(content_1.next_element, content_2)
self.assertEqual(content_1.next_sibling, content_2)
self.assertEqual(content_2.previous_element, content_1)
self.assertEqual(content_2.previous_sibling, content_1)
def test_extract_distinguishes_between_identical_strings(self):
soup = self.soup(“foobar”)
foo_1 = soup.a.string
bar_1 = soup.b.string
foo_2 = soup.new_string(“foo”)
bar_2 = soup.new_string(“bar”)
soup.a.append(foo_2)
soup.b.append(bar_2)
# Now there are two identical strings in the tag, and two
# in the tag. Let’s remove the first “foo” and the second
# “bar”.
foo_1.extract()
bar_2.extract()
self.assertEqual(foo_2, soup.a.string)
self.assertEqual(bar_2, soup.b.string)
def test_clear(self):
“””Tag.clear()”””
soup = self.soup(”
String Italicized and another
“)
# clear using extract()
a = soup.a
soup.p.clear()
self.assertEqual(len(soup.p.contents), 0)
self.assertTrue(hasattr(a, “contents”))
# clear using decompose()
em = a.em
a.clear(decompose=True)
self.assertFalse(hasattr(em, “contents”))
def test_string_set(self):
“””Tag.string = ‘string'”””
soup = self.soup(” “)
soup.a.string = “foo”
self.assertEqual(soup.a.contents, [“foo”])
soup.b.string = “bar”
self.assertEqual(soup.b.contents, [“bar”])
def test_string_set_does_not_affect_original_string(self):
soup = self.soup(“foobar”)
soup.b.string = soup.c.string
self.assertEqual(soup.a.encode(), b”
barbar”)
def test_set_string_preserves_class_of_string(self):
soup = self.soup(“”)
cdata = CData(“foo”)
soup.a.string = cdata
self.assertTrue(isinstance(soup.a.string, CData))
class TestElementObjects(SoupTest):
“””Test various features of element objects.”””
def test_len(self):
“””The length of an element is its number of children.”””
soup = self.soup(“123”)
# The BeautifulSoup object itself contains one element: the
# tag.
self.assertEqual(len(soup.contents), 1)
self.assertEqual(len(soup), 1)
# The tag contains three elements: the text node “1”, the
# tag, and the text node “3”.
self.assertEqual(len(soup.top), 3)
self.assertEqual(len(soup.top.contents), 3)
def test_member_access_invokes_find(self):
“””Accessing a Python member .foo invokes find(‘foo’)”””
soup = self.soup(”)
self.assertEqual(soup.b, soup.find(‘b’))
self.assertEqual(soup.b.i, soup.find(‘b’).find(‘i’))
self.assertEqual(soup.a, None)
def test_deprecated_member_access(self):
soup = self.soup(”)
with warnings.catch_warnings(record=True) as w:
tag = soup.bTag
self.assertEqual(soup.b, tag)
self.assertEqual(
‘.bTag is deprecated, use .find(“b”) instead.’,
str(w[0].message))
def test_has_attr(self):
“””has_attr() checks for the presence of an attribute.
Please note note: has_attr() is different from
__in__. has_attr() checks the tag’s attributes and __in__
checks the tag’s chidlren.
“””
soup = self.soup(“”)
self.assertTrue(soup.foo.has_attr(‘attr’))
self.assertFalse(soup.foo.has_attr(‘attr2’))
def test_attributes_come_out_in_alphabetical_order(self):
markup = ”
self.assertSoupEquals(markup, ”)
def test_string(self):
# A tag that contains only a text node makes that node
# available as .string.
soup = self.soup(“foo”)
self.assertEqual(soup.b.string, ‘foo’)
def test_empty_tag_has_no_string(self):
# A tag with no children has no .stirng.
soup = self.soup(“”)
self.assertEqual(soup.b.string, None)
def test_tag_with_multiple_children_has_no_string(self):
# A tag with no children has no .string.
soup = self.soup(“foo”)
self.assertEqual(soup.b.string, None)
soup = self.soup(“foobar”)
self.assertEqual(soup.b.string, None)
# Even if all the children are strings, due to trickery,
# it won’t work–but this would be a good optimization.
soup = self.soup(“foo”)
soup.a.insert(1, “bar”)
self.assertEqual(soup.a.string, None)
def test_tag_with_recursive_string_has_string(self):
# A tag with a single child which has a .string inherits that
# .string.
soup = self.soup(“foo”)
self.assertEqual(soup.a.string, “foo”)
self.assertEqual(soup.string, “foo”)
def test_lack_of_string(self):
“””Only a tag containing a single text node has a .string.”””
soup = self.soup(“feo”)
self.assertFalse(soup.b.string)
soup = self.soup(“”)
self.assertFalse(soup.b.string)
def test_all_text(self):
“””Tag.text and Tag.get_text(sep=u””) -> all child text, concatenated”””
soup = self.soup(“ar t “)
self.assertEqual(soup.a.text, “ar t “)
self.assertEqual(soup.a.get_text(strip=True), “art”)
self.assertEqual(soup.a.get_text(“,”), “a,r, , t “)
self.assertEqual(soup.a.get_text(“,”, strip=True), “a,r,t”)
class TestCDAtaListAttributes(SoupTest):
“””Testing cdata-list attributes like ‘class’.
“””
def test_single_value_becomes_list(self):
soup = self.soup(“”)
self.assertEqual([“foo”],soup.a[‘class’])
def test_multiple_values_becomes_list(self):
soup = self.soup(”
“)
self.assertEqual([“foo”, “bar”], soup.a[‘class’])
def test_multiple_values_separated_by_weird_whitespace(self):
soup = self.soup(”
“)
self.assertEqual([“foo”, “bar”, “baz”],soup.a[‘class’])
def test_attributes_joined_into_string_on_output(self):
soup = self.soup(”
“)
self.assertEqual(b’
‘, soup.a.encode())
def test_accept_charset(self):
soup = self.soup(”)
self.assertEqual([‘ISO-8859-1’, ‘UTF-8’], soup.form[‘accept-charset’])
def test_cdata_attribute_applying_only_to_one_tag(self):
data = ”
soup = self.soup(data)
# We saw in another test that accept-charset is a cdata-list
# attribute for the tag. But it’s not a cdata-list
# attribute for any other tag.
self.assertEqual(‘ISO-8859-1 UTF-8’, soup.a[‘accept-charset’])
class TestPersistence(SoupTest):
“Testing features like pickle and deepcopy.”
def setUp(self):
super(TestPersistence, self).setUp()
self.page = “””
[removed]
foo
bar
“””
self.tree = self.soup(self.page)
def test_pickle_and_unpickle_identity(self):
# Pickling a tree, then unpickling it, yields a tree identical
# to the original.
dumped = pickle.dumps(self.tree, 2)
loaded = pickle.loads(dumped)
self.assertEqual(loaded.__class__, BeautifulSoup)
self.assertEqual(loaded.decode(), self.tree.decode())
def test_deepcopy_identity(self):
# Making a deepcopy of a tree yields an identical tree.
copied = copy.deepcopy(self.tree)
self.assertEqual(copied.decode(), self.tree.decode())
def test_unicode_pickle(self):
# A tree containing Unicode characters can be pickled.
html = “\N{SNOWMAN}”
soup = self.soup(html)
dumped = pickle.dumps(soup, pickle.HIGHEST_PROTOCOL)
loaded = pickle.loads(dumped)
self.assertEqual(loaded.decode(), soup.decode())
class TestSubstitutions(SoupTest):
def test_default_formatter_is_minimal(self):
markup = “<
soup = self.soup(markup)
decoded = soup.decode(formatter=”minimal”)
# The < is converted back into < but the e-with-acute is left alone.
self.assertEqual(
decoded,
self ument_for(
"<
def test_formatter_html(self):
markup = “<
soup = self.soup(markup)
decoded = soup.decode(formatter=”html”)
self.assertEqual(
decoded,
self ument_for(“<
def test_formatter_minimal(self):
markup = “<
soup = self.soup(markup)
decoded = soup.decode(formatter=”minimal”)
# The < is converted back into < but the e-with-acute is left alone.
self.assertEqual(
decoded,
self ument_for(
"<
def test_formatter_null(self):
markup = “<
soup = self.soup(markup)
decoded = soup.decode(formatter=None)
# Neither the angle brackets nor the e-with-acute are converted.
# This is not valid HTML, but it’s what the user wanted.
self.assertEqual(decoded,
self ument_for(“>”))
def test_formatter_custom(self):
markup = “
soup = self.soup(markup)
decoded = soup.decode(formatter = lambda x: x.upper())
# Instead of normal entity conversion code, the custom
# callable is called on every string.
self.assertEqual(
decoded,
self ument_for(“BAR”))
def test_formatter_is_run_on_attribute_values(self):
markup = ‘e’
soup = self.soup(markup)
a = soup.a
expect_minimal = ‘e’
self.assertEqual(expect_minimal, a.decode())
self.assertEqual(expect_minimal, a.decode(formatter=”minimal”))
expect_html = ‘e’
self.assertEqual(expect_html, a.decode(formatter=”html”))
self.assertEqual(markup, a.decode(formatter=None))
expect_upper = ‘E’
self.assertEqual(expect_upper, a.decode(formatter=lambda x: x.upper()))
def test_prettify_accepts_formatter(self):
soup = BeautifulSoup(“foo”)
pretty = soup.prettify(formatter = lambda x: x.upper())
self.assertTrue(“FOO” in pretty)
def test_prettify_outputs_unicode_by_default(self):
soup = self.soup(“”)
self.assertEqual(str, type(soup.prettify()))
def test_prettify_can_encode_data(self):
soup = self.soup(“”)
self.assertEqual(bytes, type(soup.prettify(“utf-8”)))
def test_html_entity_substitution_off_by_default(self):
markup = “Sacr\N{LATIN SMALL LETTER E WITH ACUTE} bleu!”
soup = self.soup(markup)
encoded = soup.b.encode(“utf-8″)
self.assertEqual(encoded, markup.encode(‘utf-8’))
def test_encoding_substitution(self):
# Here’s the tag saying that a document is
# encoded in Shift-JIS.
meta_tag = (”)
soup = self.soup(meta_tag)
# Parse the document, and the charset apprears unchanged.
self.assertEqual(soup.meta[‘content’], ‘text/html; charset=x-sjis’)
# Encode the document into some encoding, and the encoding is
# substituted into the meta tag.
utf_8 = soup.encode(“utf-8″)
self.assertTrue(b”charset=utf-8” in utf_8)
euc_jp = soup.encode(“euc_jp”)
self.assertTrue(b”charset=euc_jp” in euc_jp)
shift_jis = soup.encode(“shift-jis”)
self.assertTrue(b”charset=shift-jis” in shift_jis)
utf_16_u = soup.encode(“utf-16”).decode(“utf-16”)
self.assertTrue(“charset=utf-16” in utf_16_u)
def test_encoding_substitution_doesnt_happen_if_tag_is_strained(self):
markup = (‘foo
‘)
# Beautiful Soup used to try to rewrite the meta tag even if the
# meta tag got filtered out by the strainer. This test makes
# sure that doesn’t happen.
strainer = SoupStrainer(‘pre’)
soup = self.soup(markup, parse_only=strainer)
self.assertEqual(soup.contents[0].name, ‘pre’)
class TestEncoding(SoupTest):
“””Test the ability to encode objects into strings.”””
def test_unicode_string_can_be_encoded(self):
html = “\N{SNOWMAN}”
soup = self.soup(html)
self.assertEqual(soup.b.string.encode(“utf-8”),
“\N{SNOWMAN}”.encode(“utf-8”))
def test_tag_containing_unicode_string_can_be_encoded(self):
html = “\N{SNOWMAN}”
soup = self.soup(html)
self.assertEqual(
soup.b.encode(“utf-8”), html.encode(“utf-8”))
def test_encoding_substitutes_unrecognized_characters_by_default(self):
html = “\N{SNOWMAN}”
soup = self.soup(html)
self.assertEqual(soup.b.encode(“ascii”), b”☃”)
def test_encoding_can_be_made_strict(self):
html = “\N{SNOWMAN}”
soup = self.soup(html)
self.assertRaises(
UnicodeEncodeError, soup.encode, “ascii”, errors=”strict”)
def test_decode_contents(self):
html = “\N{SNOWMAN}”
soup = self.soup(html)
self.assertEqual(“\N{SNOWMAN}”, soup.b.decode_contents())
def test_encode_contents(self):
html = “\N{SNOWMAN}”
soup = self.soup(html)
self.assertEqual(
“\N{SNOWMAN}”.encode(“utf8″), soup.b.encode_contents(
encoding=”utf8”))
def test_deprecated_renderContents(self):
html = “\N{SNOWMAN}”
soup = self.soup(html)
self.assertEqual(
“\N{SNOWMAN}”.encode(“utf8”), soup.b.renderContents())
class TestNavigableStringSubclasses(SoupTest):
def test_cdata(self):
# None of the current builders turn CDATA sections into CData
# objects, but you can create them manually.
soup = self.soup(“”)
cdata = CData(“foo”)
soup.insert(1, cdata)
self.assertEqual(str(soup), “foo”)
self.assertEqual(soup.find(text=”foo”), “foo”)
self.assertEqual(soup.contents[0], “foo”)
def test_cdata_is_never_formatted(self):
“””Text inside a CData object is passed into the formatter.
But the return value is ignored.
“””
self.count = 0
def increment(*args):
self.count += 1
return “BITTER FAILURE”
soup = self.soup(“”)
cdata = CData(“]]>”, soup.encode(formatter=increment))
self.assertEqual(1, self.count)
def test_doctype_ends_in_newline(self):
# Unlike other NavigableString subclasses, a DOCTYPE always ends
# in a newline.
doctype = Doctype(“foo”)
soup = self.soup(“”)
soup.insert(1, doctype)
self.assertEqual(soup.encode(), b”\n”)
class TestSoupSelector(TreeTest):
HTML = “””
[removed]
An H1
Some text
Some more text
An H2
Another
Bob
Another H2
me
span1a1
span1a2 test
span2a1
English
English UK
English US
French
“””
def setUp(self):
self.soup = BeautifulSoup(self.HTML)
def assertSelects(self, selector, expected_ids):
el_ids = [el[‘id’] for el in self.soup.select(selector)]
el_ids.sort()
expected_ids.sort()
self.assertEqual(expected_ids, el_ids,
“Selector %s, expected [%s], got [%s]” % (
selector, ‘, ‘.join(expected_ids), ‘, ‘.join(el_ids)
)
)
assertSelect = assertSelects
def assertSelectMultiple(self, *tests):
for selector, expected_ids in tests:
self.assertSelect(selector, expected_ids)
def test_one_tag_one(self):
els = self.soup.select(‘title’)
self.assertEqual(len(els), 1)
self.assertEqual(els[0].name, ‘title’)
self.assertEqual(els[0].contents, [‘The title’])
def test_one_tag_many(self):
els = self.soup.select(‘div’)
self.assertEqual(len(els), 3)
for div in els:
self.assertEqual(div.name, ‘div’)
def test_tag_in_tag_one(self):
els = self.soup.select(‘div div’)
self.assertSelects(‘div div’, [‘inner’])
def test_tag_in_tag_many(self):
for selector in (‘html div’, ‘html body div’, ‘body div’):
self.assertSelects(selector, [‘main’, ‘inner’, ‘footer’])
def test_tag_no_match(self):
self.assertEqual(len(self.soup.select(‘del’)), 0)
def test_invalid_tag(self):
self.assertEqual(len(self.soup.select(‘tag%t’)), 0)
def test_header_tags(self):
self.assertSelectMultiple(
(‘h1’, [‘header1’]),
(‘h2’, [‘header2’, ‘header3’]),
)
def test_class_one(self):
for selector in (‘.onep’, ‘p.onep’, ‘html p.onep’):
els = self.soup.select(selector)
self.assertEqual(len(els), 1)
self.assertEqual(els[0].name, ‘p’)
self.assertEqual(els[0][‘class’], [‘onep’])
def test_class_mismatched_tag(self):
els = self.soup.select(‘div.onep’)
self.assertEqual(len(els), 0)
def test_one_id(self):
for selector in (‘div#inner’, ‘#inner’, ‘div div#inner’):
self.assertSelects(selector, [‘inner’])
def test_bad_id(self):
els = self.soup.select(‘#doesnotexist’)
self.assertEqual(len(els), 0)
def test_items_in_id(self):
els = self.soup.select(‘div#inner p’)
self.assertEqual(len(els), 3)
for el in els:
self.assertEqual(el.name, ‘p’)
self.assertEqual(els[1][‘class’], [‘onep’])
self.assertFalse(‘class’ in els[0])
def test_a_bunch_of_emptys(self):
for selector in (‘div#main del’, ‘div#main div.oops’, ‘div div#main’):
self.assertEqual(len(self.soup.select(selector)), 0)
def test_multi_class_support(self):
for selector in (‘.class1’, ‘p.class1’, ‘.class2’, ‘p.class2’,
‘.class3’, ‘p.class3’, ‘html p.class2’, ‘div#inner .class2’):
self.assertSelects(selector, [‘pmulti’])
def test_multi_class_selection(self):
for selector in (‘.class1.class3’, ‘.class3.class2’,
‘.class1.class2.class3’):
self.assertSelects(selector, [‘pmulti’])
def test_child_selector(self):
self.assertSelects(‘.s1 > a’, [‘s1a1’, ‘s1a2’])
self.assertSelects(‘.s1 > a span’, [‘s1a2s1’])
def test_attribute_equals(self):
self.assertSelectMultiple(
(‘p[class=”onep”]’, [‘p1’]),
(‘p[id=”p1″]’, [‘p1’]),
(‘[class=”onep”]’, [‘p1’]),
(‘[id=”p1″]’, [‘p1’]),
(‘link[rel=”stylesheet”]’, [‘l1’]),
(‘link[type=”text/css”]’, [‘l1’]),
(‘link[href=”blah.css”]’, [‘l1’]),
(‘link[href=”no-blah.css”]’, []),
(‘[rel=”stylesheet”]’, [‘l1’]),
(‘[type=”text/css”]’, [‘l1’]),
(‘[href=”blah.css”]’, [‘l1’]),
(‘[href=”no-blah.css”]’, []),
(‘p[href=”no-blah.css”]’, []),
(‘[href=”no-blah.css”]’, []),
)
def test_attribute_tilde(self):
self.assertSelectMultiple(
(‘p[class~=”class1″]’, [‘pmulti’]),
(‘p[class~=”class2″]’, [‘pmulti’]),
(‘p[class~=”class3″]’, [‘pmulti’]),
(‘[class~=”class1″]’, [‘pmulti’]),
(‘[class~=”class2″]’, [‘pmulti’]),
(‘[class~=”class3″]’, [‘pmulti’]),
(‘a[rel~=”friend”]’, [‘bob’]),
(‘a[rel~=”met”]’, [‘bob’]),
(‘[rel~=”friend”]’, [‘bob’]),
(‘[rel~=”met”]’, [‘bob’]),
)
def test_attribute_startswith(self):
self.assertSelectMultiple(
(‘[rel^=”style”]’, [‘l1’]),
(‘link[rel^=”style”]’, [‘l1’]),
(‘notlink[rel^=”notstyle”]’, []),
(‘[rel^=”notstyle”]’, []),
(‘link[rel^=”notstyle”]’, []),
(‘link[href^=”bla”]’, [‘l1’]),
(‘a[href^=”http://”]’, [‘bob’, ‘me’]),
(‘[href^=”http://”]’, [‘bob’, ‘me’]),
(‘[id^=”p”]’, [‘pmulti’, ‘p1’]),
(‘[id^=”m”]’, [‘me’, ‘main’]),
(‘div[id^=”m”]’, [‘main’]),
(‘a[id^=”m”]’, [‘me’]),
)
def test_attribute_endswith(self):
self.assertSelectMultiple(
(‘[href$=”.css”]’, [‘l1’]),
(‘link[href$=”.css”]’, [‘l1’]),
(‘link[id$=”1″]’, [‘l1’]),
(‘[id$=”1″]’, [‘l1’, ‘p1’, ‘header1’, ‘s1a1’, ‘s2a1’, ‘s1a2s1’]),
(‘div[id$=”1″]’, []),
(‘[id$=”noending”]’, []),
)
def test_attribute_contains(self):
self.assertSelectMultiple(
# From test_attribute_startswith
(‘[rel*=”style”]’, [‘l1’]),
(‘link[rel*=”style”]’, [‘l1’]),
(‘notlink[rel*=”notstyle”]’, []),
(‘[rel*=”notstyle”]’, []),
(‘link[rel*=”notstyle”]’, []),
(‘link[href*=”bla”]’, [‘l1’]),
(‘a[href*=”http://”]’, [‘bob’, ‘me’]),
(‘[href*=”http://”]’, [‘bob’, ‘me’]),
(‘[id*=”p”]’, [‘pmulti’, ‘p1’]),
(‘div[id*=”m”]’, [‘main’]),
(‘a[id*=”m”]’, [‘me’]),
# From test_attribute_endswith
(‘[href*=”.css”]’, [‘l1’]),
(‘link[href*=”.css”]’, [‘l1’]),
(‘link[id*=”1″]’, [‘l1’]),
(‘[id*=”1″]’, [‘l1’, ‘p1’, ‘header1’, ‘s1a1’, ‘s1a2’, ‘s2a1’, ‘s1a2s1’]),
(‘div[id*=”1″]’, []),
(‘[id*=”noending”]’, []),
# New for this test
(‘[href*=”.”]’, [‘bob’, ‘me’, ‘l1’]),
(‘a[href*=”.”]’, [‘bob’, ‘me’]),
(‘link[href*=”.”]’, [‘l1’]),
(‘div[id*=”n”]’, [‘main’, ‘inner’]),
(‘div[id*=”nn”]’, [‘inner’]),
)
def test_attribute_exact_or_hypen(self):
self.assertSelectMultiple(
(‘p[lang|=”en”]’, [‘lang-en’, ‘lang-en-gb’, ‘lang-en-us’]),
(‘[lang|=”en”]’, [‘lang-en’, ‘lang-en-gb’, ‘lang-en-us’]),
(‘p[lang|=”fr”]’, [‘lang-fr’]),
(‘p[lang|=”gb”]’, []),
)
def test_attribute_exists(self):
self.assertSelectMultiple(
(‘[rel]’, [‘l1’, ‘bob’, ‘me’]),
(‘link[rel]’, [‘l1’]),
(‘a[rel]’, [‘bob’, ‘me’]),
(‘[lang]’, [‘lang-en’, ‘lang-en-gb’, ‘lang-en-us’, ‘lang-fr’]),
(‘p[class]’, [‘p1’, ‘pmulti’]),
(‘[blah]’, []),
(‘p[blah]’, []),
)
def test_select_on_element(self):
# Other tests operate on the tree; this operates on an element
# within the tree.
inner = self.soup.find(“div”, id=”main”)
selected = inner.select(“div”)
# The tag was selected. The
# tag was not.
self.assertSelectsIDs(selected, [‘inner’])
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/build/lib/bs4/tests/__init__.py
“The beautifulsoup tests.”
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/build/lib/bs4/__init__.py
“””Beautiful Soup
Elixir and Tonic
“The Screen-Scraper’s Friend”
http://www.crummy.com/software/BeautifulSoup/
Beautiful Soup uses a pluggable XML or HTML parser to parse a
(possibly invalid) document into a tree representation. Beautiful Soup
provides provides methods and Pythonic idioms that make it easy to
navigate, search, and modify the parse tree.
Beautiful Soup works with Python 2.6 and up. It works better if lxml
and/or html5lib is installed.
For more than you ever wanted to know about Beautiful Soup, see the
documentation:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
“””
__author__ = “Leonard Richardson (leonardr@segfault.org)”
__version__ = “4.1.0”
__copyright__ = “Copyright (c) 2004-2012 Leonard Richardson”
__license__ = “MIT”
__all__ = [‘BeautifulSoup’]
import re
import warnings
from .builder import builder_registry
from .dammit import UnicodeDammit
from .element import (
CData,
Comment,
DEFAULT_OUTPUT_ENCODING,
Declaration,
Doctype,
NavigableString,
PageElement,
ProcessingInstruction,
ResultSet,
SoupStrainer,
Tag,
)
# The very first thing we do is give a useful error if someone is
# running this code under Python 3 without converting it.
syntax_error = ‘You are trying to run the Python 2 version of Beautiful Soup under Python 3. This will not work. You need to convert the code, either by installing it (`python setup.py install`) or by running 2to3 (`2to3 -w bs4`).’
class BeautifulSoup(Tag):
“””
This class defines the basic interface called by the tree builders.
These methods will be called by the parser:
reset()
feed(markup)
The tree builder may call these methods from its feed() implementation:
handle_starttag(name, attrs) # See note about return value
handle_endtag(name)
handle_data(data) # Appends to the current data node
endData(containerClass=NavigableString) # Ends the current data node
No matter how complicated the underlying parser is, you should be
able to build a tree using ‘start tag’ events, ‘end tag’ events,
‘data’ events, and “done with data” events.
If you encounter an empty-element tag (aka a self-closing tag,
like HTML’s
tag), call handle_starttag and then
handle_endtag.
“””
ROOT_TAG_NAME = ‘[document]’
# If the end-user gives no indication which tree builder they
# want, look for one with these features.
DEFAULT_BUILDER_FEATURES = [‘html’, ‘fast’]
# Used when determining whether a text node is all whitespace and
# can be replaced with a single space. A text node that contains
# fancy Unicode spaces (usually non-breaking) should be left
# alone.
STRIP_ASCII_SPACES = {9: None, 10: None, 12: None, 13: None, 32: None, }
def __init__(self, markup=””, features=None, builder=None,
parse_only=None, from_encoding=None, **kwargs):
“””The Soup object is initialized as the ‘root tag’, and the
provided markup (which can be a string or a file-like object)
is fed into the underlying parser.”””
if ‘convertEntities’ in kwargs:
warnings.warn(
“BS4 does not respect the convertEntities argument to the ”
“BeautifulSoup constructor. Entities are always converted ”
“to Unicode characters.”)
if ‘markupMassage’ in kwargs:
del kwargs[‘markupMassage’]
warnings.warn(
“BS4 does not respect the markupMassage argument to the ”
“BeautifulSoup constructor. The tree builder is responsible ”
“for any necessary markup massage.”)
if ‘smartQuotesTo’ in kwargs:
del kwargs[‘smartQuotesTo’]
warnings.warn(
“BS4 does not respect the smartQuotesTo argument to the ”
“BeautifulSoup constructor. Smart quotes are always converted ”
“to Unicode characters.”)
if ‘selfClosingTags’ in kwargs:
del kwargs[‘selfClosingTags’]
warnings.warn(
“BS4 does not respect the selfClosingTags argument to the ”
“BeautifulSoup constructor. The tree builder is responsible ”
“for understanding self-closing tags.”)
if ‘isHTML’ in kwargs:
del kwargs[‘isHTML’]
warnings.warn(
“BS4 does not respect the isHTML argument to the ”
“BeautifulSoup constructor. You can pass in features=’html’ ”
“or features=’xml’ to get a builder capable of handling ”
“one or the other.”)
def deprecated_argument(old_name, new_name):
if old_name in kwargs:
warnings.warn(
‘The “%s” argument to the BeautifulSoup constructor ‘
‘has been renamed to “%s.”‘ % (old_name, new_name))
value = kwargs[old_name]
del kwargs[old_name]
return value
return None
parse_only = parse_only or deprecated_argument(
“parseOnlyThese”, “parse_only”)
from_encoding = from_encoding or deprecated_argument(
“fromEncoding”, “from_encoding”)
if len(kwargs) > 0:
arg = list(kwargs.keys()).pop()
raise TypeError(
“__init__() got an unexpected keyword argument ‘%s'” % arg)
if builder is None:
if isinstance(features, str):
features = [features]
if features is None or len(features) == 0:
features = self.DEFAULT_BUILDER_FEATURES
builder_class = builder_registry.lookup(*features)
if builder_class is None:
raise ValueError(
“Couldn’t find a tree builder with the features you ”
“requested: %s. Do you need to install a parser library?”
% “,”.join(features))
builder = builder_class()
self.builder = builder
self.is_xml = builder.is_xml
self.builder.soup = self
self.parse_only = parse_only
self.reset()
if hasattr(markup, ‘read’): # It’s a file-type object.
markup = markup.read()
(self.markup, self.original_encoding, self.declared_html_encoding,
self.contains_replacement_characters) = (
self.builder.prepare_markup(markup, from_encoding))
try:
self._feed()
except StopParsing:
pass
# Clear out the markup and remove the builder’s circular
# reference to this object.
self.markup = None
self.builder.soup = None
def _feed(self):
# Convert the document to Unicode.
self.builder.reset()
self.builder.feed(self.markup)
# Close out any unfinished strings and close all the open tags.
self.endData()
while self.currentTag.name != self.ROOT_TAG_NAME:
self.popTag()
def reset(self):
Tag.__init__(self, self, self.builder, self.ROOT_TAG_NAME)
self.hidden = 1
self.builder.reset()
self.currentData = []
self.currentTag = None
self.tagStack = []
self.pushTag(self)
def new_tag(self, name, namespace=None, nsprefix=None, **attrs):
“””Create a new tag associated with this soup.”””
return Tag(None, self.builder, name, namespace, nsprefix, attrs)
def new_string(self, s):
“””Create a new NavigableString associated with this soup.”””
navigable = NavigableString(s)
navigable.setup()
return navigable
def insert_before(self, successor):
raise ValueError(“BeautifulSoup objects don’t support insert_before().”)
def insert_after(self, successor):
raise ValueError(“BeautifulSoup objects don’t support insert_after().”)
def popTag(self):
tag = self.tagStack.pop()
#print “Pop”, tag.name
if self.tagStack:
self.currentTag = self.tagStack[-1]
return self.currentTag
def pushTag(self, tag):
#print “Push”, tag.name
if self.currentTag:
self.currentTag.contents.append(tag)
self.tagStack.append(tag)
self.currentTag = self.tagStack[-1]
def endData(self, containerClass=NavigableString):
if self.currentData:
currentData = ”.join(self.currentData)
if (currentData.translate(self.STRIP_ASCII_SPACES) == ” and
not set([tag.name for tag in self.tagStack]).intersection(
self.builder.preserve_whitespace_tags)):
if ‘\n’ in currentData:
currentData = ‘\n’
else:
currentData = ‘ ‘
self.currentData = []
if self.parse_only and len(self.tagStack) <= 1 and \
(not self.parse_only.text or \
not self.parse_only.search(currentData)):
return
o = containerClass(currentData)
self.object_was_parsed(o)
def object_was_parsed(self, o):
"""Add an object to the parse tree."""
o.setup(self.currentTag, self.previous_element)
if self.previous_element:
self.previous_element.next_element = o
self.previous_element = o
self.currentTag.contents.append(o)
def _popToTag(self, name, nsprefix=None, inclusivePop=True):
"""Pops the tag stack up to and including the most recent
instance of the given tag. If inclusivePop is false, pops the tag
stack up to but *not* including the most recent instqance of
the given tag."""
#print "Popping to %s" % name
if name == self.ROOT_TAG_NAME:
return
numPops = 0
mostRecentTag = None
for i in range(len(self.tagStack) - 1, 0, -1):
if (name == self.tagStack[i].name
and nsprefix == self.tagStack[i].nsprefix == nsprefix):
numPops = len(self.tagStack) - i
break
if not inclusivePop:
numPops = numPops - 1
for i in range(0, numPops):
mostRecentTag = self.popTag()
return mostRecentTag
def handle_starttag(self, name, namespace, nsprefix, attrs):
"""Push a start tag on to the stack.
If this method returns None, the tag was rejected by the
SoupStrainer. You should proceed as if the tag had not occured
in the document. For instance, if this was a self-closing tag,
don't call handle_endtag.
"""
# print "Start tag %s: %s" % (name, attrs)
self.endData()
if (self.parse_only and len(self.tagStack) <= 1
and (self.parse_only.text
or not self.parse_only.search_tag(name, attrs))):
return None
tag = Tag(self, self.builder, name, namespace, nsprefix, attrs,
self.currentTag, self.previous_element)
if tag is None:
return tag
if self.previous_element:
self.previous_element.next_element = tag
self.previous_element = tag
self.pushTag(tag)
return tag
def handle_endtag(self, name, nsprefix=None):
#print "End tag: " + name
self.endData()
self._popToTag(name, nsprefix)
def handle_data(self, data):
self.currentData.append(data)
def decode(self, pretty_print=False,
eventual_encoding=DEFAULT_OUTPUT_ENCODING,
formatter="minimal"):
"""Returns a string or Unicode representation of this document.
To get Unicode, pass None for encoding."""
if self.is_xml:
# Print the XML declaration
encoding_part = ''
if eventual_encoding != None:
encoding_part = ' encoding="%s"' % eventual_encoding
prefix = '\n’ % encoding_part
else:
prefix = ”
if not pretty_print:
indent_level = None
else:
indent_level = 0
return prefix + super(BeautifulSoup, self).decode(
indent_level, eventual_encoding, formatter)
class BeautifulStoneSoup(BeautifulSoup):
“””Deprecated interface to an XML parser.”””
def __init__(self, *args, **kwargs):
kwargs[‘features’] = ‘xml’
warnings.warn(
‘The BeautifulStoneSoup class is deprecated. Instead of using ‘
‘it, pass features=”xml” into the BeautifulSoup constructor.’)
super(BeautifulStoneSoup, self).__init__(*args, **kwargs)
class StopParsing(Exception):
pass
#By default, act as an HTML pretty-printer.
if __name__ == ‘__main__’:
import sys
soup = BeautifulSoup(sys.stdin)
print(soup.prettify())
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/COPYING.txt
Beautiful Soup is made available under the MIT license:
Copyright (c) 2004-2012 Leonard Richardson
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
“Software”), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE, DAMMIT.
Beautiful Soup incorporates code from the html5lib library, which is
also made available under the MIT license.
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/doc/Makefile
# Makefile for Sphinx documentation
#
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
PAPER =
BUILDDIR = build
# Internal variables.
PAPEROPT_a4 = -D latex_paper_size=a4
PAPEROPT_letter = -D latex_paper_size=letter
ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
.PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest
help:
@echo “Please use \`make
@echo ” html to make standalone HTML files”
@echo ” dirhtml to make HTML files named index.html in directories”
@echo ” singlehtml to make a single large HTML file”
@echo ” pickle to make pickle files”
@echo ” json to make JSON files”
@echo ” htmlhelp to make HTML files and a HTML help project”
@echo ” qthelp to make HTML files and a qthelp project”
@echo ” devhelp to make HTML files and a Devhelp project”
@echo ” epub to make an epub”
@echo ” latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter”
@echo ” latexpdf to make LaTeX files and run them through pdflatex”
@echo ” text to make text files”
@echo ” man to make manual pages”
@echo ” changes to make an overview of all changed/added/deprecated items”
@echo ” linkcheck to check all external links for integrity”
@echo ” doctest to run all doctests embedded in the documentation (if enabled)”
clean:
-rm -rf $(BUILDDIR)/*
html:
$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
@echo
@echo “Build finished. The HTML pages are in $(BUILDDIR)/html.”
dirhtml:
$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
@echo
@echo “Build finished. The HTML pages are in $(BUILDDIR)/dirhtml.”
singlehtml:
$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
@echo
@echo “Build finished. The HTML page is in $(BUILDDIR)/singlehtml.”
pickle:
$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
@echo
@echo “Build finished; now you can process the pickle files.”
json:
$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
@echo
@echo “Build finished; now you can process the JSON files.”
htmlhelp:
$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
@echo
@echo “Build finished; now you can run HTML Help Workshop with the” \
“.hhp project file in $(BUILDDIR)/htmlhelp.”
qthelp:
$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
@echo
@echo “Build finished; now you can run “qcollectiongenerator” with the” \
“.qhcp project file in $(BUILDDIR)/qthelp, like this:”
@echo “# qcollectiongenerator $(BUILDDIR)/qthelp/BeautifulSoup.qhcp”
@echo “To view the help file:”
@echo “# assistant -collectionFile $(BUILDDIR)/qthelp/BeautifulSoup.qhc”
devhelp:
$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
@echo
@echo “Build finished.”
@echo “To view the help file:”
@echo “# mkdir -p $$HOME/.local/share/devhelp/BeautifulSoup”
@echo “# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/BeautifulSoup”
@echo “# devhelp”
epub:
$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
@echo
@echo “Build finished. The epub file is in $(BUILDDIR)/epub.”
latex:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo
@echo “Build finished; the LaTeX files are in $(BUILDDIR)/latex.”
@echo “Run \`make’ in that directory to run these through (pdf)latex” \
“(use \`make latexpdf’ here to do that automatically).”
latexpdf:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo “Running LaTeX files through pdflatex…”
make -C $(BUILDDIR)/latex all-pdf
@echo “pdflatex finished; the PDF files are in $(BUILDDIR)/latex.”
text:
$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
@echo
@echo “Build finished. The text files are in $(BUILDDIR)/text.”
man:
$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
@echo
@echo “Build finished. The manual pages are in $(BUILDDIR)/man.”
changes:
$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
@echo
@echo “The overview file is in $(BUILDDIR)/changes.”
linkcheck:
$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
@echo
@echo “Link check complete; look for any errors in the above output ” \
“or in $(BUILDDIR)/linkcheck/output.txt.”
doctest:
$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
@echo “Testing of doctests in the sources finished, look at the ” \
“results in $(BUILDDIR)/doctest/output.txt.”
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/doc/source/6.1
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/doc/source/conf.py
# -*- coding: utf-8 -*-
#
# Beautiful Soup documentation build configuration file, created by
# sphinx-quickstart on Thu Jan 26 11:22:55 2012.
#
# This file is execfile()d with the current directory set to its containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.
import sys, os
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#sys.path.insert(0, os.path.abspath(‘.’))
# — General configuration —————————————————–
# If your documentation needs a minimal Sphinx version, state it here.
#needs_sphinx = ‘1.0’
# Add any Sphinx extension module names here, as strings. They can be extensions
# coming with Sphinx (named ‘sphinx.ext.*’) or your custom ones.
extensions = []
# Add any paths that contain templates here, relative to this directory.
templates_path = [‘_templates’]
# The suffix of source filenames.
source_suffix = ‘.rst’
# The encoding of source files.
#source_encoding = ‘utf-8-sig’
# The master toctree document.
master_doc = ‘index’
# General information about the project.
project = u’Beautiful Soup’
copyright = u’2012, Leonard Richardson’
# The version info for the project you’re documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = ‘4’
# The full version, including alpha/beta/rc tags.
release = ‘4.0.0’
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#language = None
# There are two options for replacing |today|: either, you set today to some
# non-false value, then it is used:
#today = ”
# Else, today_fmt is used as the format for a strftime call.
#today_fmt = ‘%B %d, %Y’
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = []
# The reST default role (used for this markup: `text`) to use for all documents.
#default_role = None
# If true, ‘()’ will be appended to :func: etc. cross-reference text.
#add_function_parentheses = True
# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
#add_module_names = True
# If true, sectionauthor and moduleauthor directives will be shown in the
# output. They are ignored by default.
#show_authors = False
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = ‘sphinx’
# A list of ignored prefixes for module index sorting.
#modindex_common_prefix = []
# — Options for HTML output —————————————————
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
html_theme = ‘default’
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#html_theme_options = {}
# Add any paths that contain custom themes here, relative to this directory.
#html_theme_path = []
# The name for this set of Sphinx documents. If None, it defaults to
# “
#html_title = None
# A shorter title for the navigation bar. Default is the same as html_title.
#html_short_title = None
# The name of an image file (relative to this directory) to place at the top
# of the sidebar.
#html_logo = None
# The name of an image file (within the static path) to use as favicon of the
# docs. This file should be a Windows icon file (.ico) being 16×16 or 32×32
# pixels large.
#html_favicon = None
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named “default.css” will overwrite the builtin “default.css”.
html_static_path = [‘_static’]
# If not ”, a ‘Last updated on:’ timestamp is inserted at every page bottom,
# using the given strftime format.
#html_last_updated_fmt = ‘%b %d, %Y’
# If true, SmartyPants will be used to convert quotes and dashes to
# typographically correct entities.
#html_use_smartypants = True
# Custom sidebar templates, maps document names to template names.
#html_sidebars = {}
# Additional templates that should be rendered to pages, maps page names to
# template names.
#html_additional_pages = {}
# If false, no module index is generated.
#html_domain_indices = True
# If false, no index is generated.
#html_use_index = True
# If true, the index is split into individual pages for each letter.
#html_split_index = False
# If true, links to the reST sources are added to the pages.
#html_show_sourcelink = True
# If true, “Created using Sphinx” is shown in the HTML footer. Default is True.
#html_show_sphinx = True
# If true, “(C) Copyright …” is shown in the HTML footer. Default is True.
#html_show_copyright = True
# If true, an OpenSearch description file will be output, and all pages will
# contain a tag referring to it. The value of this option must be the
# base URL from which the finished HTML is served.
#html_use_opensearch = ”
# This is the file name suffix for HTML files (e.g. “.xhtml”).
#html_file_suffix = None
# Output file base name for HTML help builder.
htmlhelp_basename = ‘BeautifulSoupdoc’
# — Options for LaTeX output ————————————————–
# The paper size (‘letter’ or ‘a4’).
#latex_paper_size = ‘letter’
# The font size (’10pt’, ’11pt’ or ’12pt’).
#latex_font_size = ’10pt’
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title, author, documentclass [howto/manual]).
latex_documents = [
(‘index’, ‘BeautifulSoup.tex’, u’Beautiful Soup Documentation’,
u’Leonard Richardson’, ‘manual’),
]
# The name of an image file (relative to this directory) to place at the top of
# the title page.
#latex_logo = None
# For “manual” documents, if this is true, then toplevel headings are parts,
# not chapters.
#latex_use_parts = False
# If true, show page references after internal links.
#latex_show_pagerefs = False
# If true, show URL addresses after external links.
#latex_show_urls = False
# Additional stuff for the LaTeX preamble.
#latex_preamble = ”
# Documents to append as an appendix to all manuals.
#latex_appendices = []
# If false, no module index is generated.
#latex_domain_indices = True
# — Options for manual page output ——————————————–
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
(‘index’, ‘beautifulsoup’, u’Beautiful Soup Documentation’,
[u’Leonard Richardson’], 1)
]
# — Options for Epub output —————————————————
# Bibliographic Dublin Core info.
epub_title = u’Beautiful Soup’
epub_author = u’Leonard Richardson’
epub_publisher = u’Leonard Richardson’
epub_copyright = u’2012, Leonard Richardson’
# The language of the text. It defaults to the language option
# or en if the language is not set.
#epub_language = ”
# The scheme of the identifier. Typical schemes are ISBN or URL.
#epub_scheme = ”
# The unique identifier of the text. This can be a ISBN number
# or the project homepage.
#epub_identifier = ”
# A unique identification for the text.
#epub_uid = ”
# HTML files that should be inserted before the pages created by sphinx.
# The format is a list of tuples containing the path and title.
#epub_pre_files = []
# HTML files shat should be inserted after the pages created by sphinx.
# The format is a list of tuples containing the path and title.
#epub_post_files = []
# A list of files that should not be packed into the epub file.
#epub_exclude_files = []
# The depth of the table of contents in toc.ncx.
#epub_tocdepth = 3
# Allow duplicate toc entries.
#epub_tocdup = True
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/doc/source/index.rstBeautiful Soup Documentation
============================
.. image:: 6.1
:align: right
:alt: “The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself.”
`Beautiful Soup `_ is a
Python library for pulling data out of HTML and XML files. It works
with your favorite parser to provide idiomatic ways of navigating,
searching, and modifying the parse tree. It commonly saves programmers
hours or days of work.
These instructions illustrate all major features of Beautiful Soup 4,
with examples. I show you what the library is good for, how it works,
how to use it, how to make it do what you want, and what to do when it
violates your expectations.
The examples in this documentation should work the same way in Python
2.7 and Python 3.2.
You might be looking for the documentation for `Beautiful Soup 3
`_. If
you want to learn about the differences between Beautiful Soup 3 and
Beautiful Soup 4, see `Porting code to BS4`_.
Getting help
————
If you have questions about Beautiful Soup, or run into problems,
`send mail to the discussion group
`_.
Quick Start
===========
Here’s an HTML document I’ll be using as an example throughout this
document. It’s part of a story from `Alice in Wonderland`::
html_doc = “””
The Dormouse’s story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
…
“””
Running the “three sisters” document through Beautiful Soup gives us a
“BeautifulSoup“ object, which represents the document as a nested
data structure::
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
print(soup.prettify())
#
#
#
#
#
#
#
# The Dormouse’s story
#
#
#
# Once upon a time there were three little sisters; and their names were
#
# Elsie
#
# ,
#
# Lacie
#
# and
#
# Tillie
#
# ; and they lived at the bottom of a well.
#
#
# …
#
#
#
Here are some simple ways to navigate that data structure::
soup.title
#
soup.title.name
# u’title’
soup.title.string
# u’The Dormouse’s story’
soup.title.parent.name
# u’head’
soup.p
# The Dormouse’s story
soup.p[‘class’]
# u’title’
soup.a
# Elsie
soup.find_all(‘a’)
# [Elsie,
# Lacie,
# Tillie]
soup.find(id=”link3″)
# Tillie
One common task is extracting all the URLs found within a page’s tags::
for link in soup.find_all(‘a’):
print(link.get(‘href’))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
Another common task is extracting all the text from a page::
print(soup.get_text())
# The Dormouse’s story
#
# The Dormouse’s story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# …
Does this look like what you need? If so, read on.
Installing Beautiful Soup
=========================
If you’re using a recent version of Debian or Ubuntu Linux, you can
install Beautiful Soup with the system package manager:
:kbd:`$ apt-get install python-beautifulsoup4`
Beautiful Soup 4 is published through PyPi, so if you can’t install it
with the system packager, you can install it with “easy_install“ or
“pip“. The package name is “beautifulsoup4“, and the same package
works on Python 2 and Python 3.
:kbd:`$ easy_install beautifulsoup4`
:kbd:`$ pip install beautifulsoup4`
(The “BeautifulSoup“ package is probably `not` what you want. That’s
the previous major release, `Beautiful Soup 3`_. Lots of software uses
BS3, so it’s still available, but if you’re writing new code you
should install “beautifulsoup4“.)
If you don’t have “easy_install“ or “pip“ installed, you can
`download the Beautiful Soup 4 source tarball
`_ and
install it with “setup.py“.
:kbd:`$ python setup.py install`
If all else fails, the license for Beautiful Soup allows you to
package the entire library with your application. You can download the
tarball, copy its “bs4“ directory into your application’s codebase,
and use Beautiful Soup without installing it at all.
I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it
should work with other recent versions.
Problems after installation
—————————
Beautiful Soup is packaged as Python 2 code. When you install it for
use with Python 3, it’s automatically converted to Python 3 code. If
you don’t install the package, the code won’t be converted. There have
also been reports on Windows machines of the wrong version being
installed.
If you get the “ImportError“ “No module named HTMLParser”, your
problem is that you’re running the Python 2 version of the code under
Python 3.
If you get the “ImportError“ “No module named html.parser”, your
problem is that you’re running the Python 3 version of the code under
Python 2.
In both cases, your best bet is to completely remove the Beautiful
Soup installation from your system (including any directory created
when you unzipped the tarball) and try the installation again.
If you get the “SyntaxError“ “Invalid syntax” on the line
“ROOT_TAG_NAME = u'[document]’“, you need to convert the Python 2
code to Python 3. You can do this either by installing the package:
:kbd:`$ python3 setup.py install`
or by manually running Python’s “2to3“ conversion script on the
“bs4“ directory:
:kbd:`$ 2to3-3.2 -w bs4`
.. _parser-installation:
Installing a parser
——————-
Beautiful Soup supports the HTML parser included in Python’s standard
library, but it also supports a number of third-party Python parsers.
One is the `lxml parser `_. Depending on your setup,
you might install lxml with one of these commands:
:kbd:`$ apt-get install python-lxml`
:kbd:`$ easy_install lxml`
:kbd:`$ pip install lxml`
If you’re using Python 2, another alternative is the pure-Python
`html5lib parser `_, which parses
HTML the way a web browser does. Depending on your setup, you might
install html5lib with one of these commands:
:kbd:`$ apt-get install python-html5lib`
:kbd:`$ easy_install html5lib`
:kbd:`$ pip install html5lib`
This table summarizes the advantages and disadvantages of each parser library:
+———————-+——————————————–+——————————–+————————–+
| Parser | Typical usage | Advantages | Disadvantages |
+———————-+——————————————–+——————————–+————————–+
| Python’s html.parser | “BeautifulSoup(markup, “html.parser”)“ | * Batteries included | * Not very lenient |
| | | * Decent speed | (before Python 2.7.3 |
| | | * Lenient (as of Python 2.7.3 | or 3.2.2) |
| | | and 3.2.) | |
+———————-+——————————————–+——————————–+————————–+
| lxml’s HTML parser | “BeautifulSoup(markup, “lxml”)“ | * Very fast | * External C dependency |
| | | * Lenient | |
+———————-+——————————————–+——————————–+————————–+
| lxml’s XML parser | “BeautifulSoup(markup, [“lxml”, “xml”])“ | * Very fast | * External C dependency |
| | “BeautifulSoup(markup, “xml”)“ | * The only currently supported | |
| | | XML parser | |
+———————-+——————————————–+——————————–+————————–+
| html5lib | “BeautifulSoup(markup, html5lib)“ | * Extremely lenient | * Very slow |
| | | * Parses pages the same way a | * External Python |
| | | web browser does | dependency |
| | | * Creates valid HTML5 | * Python 2 only |
+———————-+——————————————–+——————————–+————————–+
If you can, I recommend you install and use lxml for speed. If you’re
using a version of Python 2 earlier than 2.7.3, or a version of Python
3 earlier than 3.2.2, it’s `essential` that you install lxml or
html5lib–Python’s built-in HTML parser is just not very good in older
versions.
Note that if a document is invalid, different parsers will generate
different Beautiful Soup trees for it. See `Differences
between parsers`_ for details.
Making the soup
===============
To parse a document, pass it into the “BeautifulSoup“
constructor. You can pass in a string or an open filehandle::
from bs4 import BeautifulSoup
soup = BeautifulSoup(open(“index.html”))
soup = BeautifulSoup(“data”)
First, the document is converted to Unicode, and HTML entities are
converted to Unicode characters::
BeautifulSoup(“Sacré bleu!”)
Sacré bleu!
Beautiful Soup then parses the document using the best available
parser. It will use an HTML parser unless you specifically tell it to
use an XML parser. (See `Parsing XML`_.)
Kinds of objects
================
Beautiful Soup transforms a complex HTML document into a complex tree
of Python objects. But you’ll only ever have to deal with about four
`kinds` of objects.
.. _Tag:
“Tag“
——-
A “Tag“ object corresponds to an XML or HTML tag in the original document::
soup = BeautifulSoup(‘Extremely bold’)
tag = soup.b
type(tag)
#
Tags have a lot of attributes and methods, and I’ll cover most of them
in `Navigating the tree`_ and `Searching the tree`_. For now, the most
important features of a tag are its name and attributes.
Name
^^^^
Every tag has a name, accessible as “.name“::
tag.name
# u’b’
If you change a tag’s name, the change will be reflected in any HTML
markup generated by Beautiful Soup::
tag.name = “blockquote”
tag
# Extremely bold
Attributes
^^^^^^^^^^
A tag may have any number of attributes. The tag ““ has an attribute “class” whose value is
“boldest”. You can access a tag’s attributes by treating the tag like
a dictionary::
tag[‘class’]
# u’boldest’
You can access that dictionary directly as “.attrs“::
tag.attrs
# {u’class’: u’boldest’}
You can add, remove, and modify a tag’s attributes. Again, this is
done by treating the tag as a dictionary::
tag[‘class’] = ‘verybold’
tag[‘id’] = 1
tag
# Extremely bold
del tag[‘class’]
del tag[‘id’]
tag
# Extremely bold
tag[‘class’]
# KeyError: ‘class’
print(tag.get(‘class’))
# None
.. _multivalue:
Multi-valued attributes
&&&&&&&&&&&&&&&&&&&&&&&
HTML 4 defines a few attributes that can have multiple values. HTML 5
removes a couple of them, but defines a few more. The most common
multi-valued attribute is “class“ (that is, a tag can have more than
one CSS class). Others include “rel“, “rev“, “accept-charset“,
“headers“, and “accesskey“. Beautiful Soup presents the value(s)
of a multi-valued attribute as a list::
css_soup = BeautifulSoup(‘
‘)
css_soup.p[‘class’]
# [“body”, “strikeout”]
css_soup = BeautifulSoup(‘
‘)
css_soup.p[‘class’]
# [“body”]
If an attribute `looks` like it has more than one value, but it’s not
a multi-valued attribute as defined by any version of the HTML
standard, Beautiful Soup will leave the attribute alone::
id_soup = BeautifulSoup(‘
‘)
id_soup.p[‘id’]
# ‘my id’
When you turn a tag back into a string, multiple attribute values are
consolidated::
rel_soup = BeautifulSoup(‘Back to the homepage
‘)
rel_soup.a[‘rel’]
# [‘index’]
rel_soup.a[‘rel’] = [‘index’, ‘contents’]
print(rel_soup.p)
# Back to the homepage
If you parse a document as XML, there are no multi-valued attributes::
xml_soup = BeautifulSoup(‘
‘, ‘xml’)
xml_soup.p[‘class’]
# u’body strikeout’
“NavigableString“
——————-
A string corresponds to a bit of text within a tag. Beautiful Soup
uses the “NavigableString“ class to contain these bits of text::
tag.string
# u’Extremely bold’
type(tag.string)
#
A “NavigableString“ is just like a Python Unicode string, except
that it also supports some of the features described in `Navigating
the tree`_ and `Searching the tree`_. You can convert a
“NavigableString“ to a Unicode string with “unicode()“::
unicode_string = unicode(tag.string)
unicode_string
# u’Extremely bold’
type(unicode_string)
#
You can’t edit a string in place, but you can replace one string with
another, using :ref:`replace_with`::
tag.string.replace_with(“No longer bold”)
tag
# No longer bold
“NavigableString“ supports most of the features described in
`Navigating the tree`_ and `Searching the tree`_, but not all of
them. In particular, since a string can’t contain anything (the way a
tag may contain a string or another tag), strings don’t support the
“.contents“ or “.string“ attributes, or the “find()“ method.
“BeautifulSoup“
—————–
The “BeautifulSoup“ object itself represents the document as a
whole. For most purposes, you can treat it as a :ref:`Tag`
object. This means it supports most of the methods described in
`Navigating the tree`_ and `Searching the tree`_.
Since the “BeautifulSoup“ object doesn’t correspond to an actual
HTML or XML tag, it has no name and no attributes. But sometimes it’s
useful to look at its “.name“, so it’s been given the special
“.name“ “[document]”::
soup.name
# u'[document]’
Comments and other special strings
———————————-
“Tag“, “NavigableString“, and “BeautifulSoup“ cover almost
everything you’ll see in an HTML or XML file, but there are a few
leftover bits. The only one you’ll probably ever need to worry about
is the comment::
markup = “”
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
#
The “Comment“ object is just a special type of “NavigableString“::
comment
# u’Hey, buddy. Want to buy a used parser’
But when it appears as part of an HTML document, a “Comment“ is
displayed with special formatting::
print(soup.b.prettify())
#
#
#
Beautiful Soup defines classes for anything else that might show up in
an XML document: “CData“, “ProcessingInstruction“,
“Declaration“, and “Doctype“. Just like “Comment“, these classes
are subclasses of “NavigableString“ that add something extra to the
string. Here’s an example that replaces the comment with a CDATA
block::
from bs4 import CData
cdata = CData(“A CDATA block”)
comment.replace_with(cdata)
print(soup.b.prettify())
#
# A CDATA block
#
Navigating the tree
===================
Here’s the “Three sisters” HTML document again::
html_doc = “””
The Dormouse’s story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
…
“””
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
I’ll use this as an example to show you how to move from one part of
a document to another.
Going down
———-
Tags may contain strings and other tags. These elements are the tag’s
`children`. Beautiful Soup provides a lot of different attributes for
navigating and iterating over a tag’s children.
Note that Beautiful Soup strings don’t support any of these
attributes, because a string can’t have children.
Navigating using tag names
^^^^^^^^^^^^^^^^^^^^^^^^^^
The simplest way to navigate the parse tree is to say the name of the
tag you want. If you want the tag, just say “soup.head“::
soup.head
#
soup.title
#
You can do use this trick again and again to zoom in on a certain part
of the parse tree. This code gets the first tag beneath the tag::
soup.body.b
# The Dormouse’s story
Using a tag name as an attribute will give you only the `first` tag by that
name::
soup.a
# Elsie
If you need to get `all` the tags, or anything more complicated
than the first tag with a certain name, you’ll need to use one of the
methods described in `Searching the tree`_, such as `find_all()`::
soup.find_all(‘a’)
# [
Elsie,
# Lacie,
# Tillie]
“.contents“ and “.children“
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
A tag’s children are available in a list called “.contents“::
head_tag = soup.head
head_tag
#
head_tag.contents
[]
title_tag = head_tag.contents[0]
title_tag
#
title_tag.contents
# [u’The Dormouse’s story’]
The “BeautifulSoup“ object itself has children. In this case, the
tag is the child of the “BeautifulSoup“ object.::
len(soup.contents)
# 1
soup.contents[0].name
# u’html’
A string does not have “.contents“, because it can’t contain
anything::
text = title_tag.contents[0]
text.contents
# AttributeError: ‘NavigableString’ object has no attribute ‘contents’
Instead of getting them as a list, you can iterate over a tag’s
children using the “.children“ generator::
for child in title_tag.children:
print(child)
# The Dormouse’s story
“.descendants“
^^^^^^^^^^^^^^^^
The “.contents“ and “.children“ attributes only consider a tag’s
`direct` children. For instance, the tag has a single direct
child–the ]
But the tag. The “.descendants“ attribute lets you iterate over `all`
of a tag’s children, recursively: its direct children, the children of
its direct children, and so on::
for child in head_tag.descendants:
print(child)
#
# The Dormouse’s story
The tag has only one child, but it has two descendants: the
tag), but it has a whole lot of
descendants::
len(list(soup.children))
# 1
len(list(soup.descendants))
# 25
.. _.string:
“.string“
^^^^^^^^^^^
If a tag has only one child, and that child is a “NavigableString“,
the child is made available as “.string“::
title_tag.string
# u’The Dormouse’s story’
If a tag’s only child is another tag, and `that` tag has a
“.string“, then the parent tag is considered to have the same
“.string“ as its child::
head_tag.contents
# []
head_tag.string
# u’The Dormouse’s story’
If a tag contains more than one thing, then it’s not clear what
“.string“ should refer to, so “.string“ is defined to be
“None“::
print(soup.html.string)
# None
.. _string-generators:
“.strings“ and “stripped_strings“
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If there’s more than one thing inside a tag, you can still look at
just the strings. Use the “.strings“ generator::
for string in soup.strings:
print(repr(string))
# u”The Dormouse’s story”
# u’\n\n’
# u”The Dormouse’s story”
# u’\n\n’
# u’Once upon a time there were three little sisters; and their names were\n’
# u’Elsie’
# u’,\n’
# u’Lacie’
# u’ and\n’
# u’Tillie’
# u’;\nand they lived at the bottom of a well.’
# u’\n\n’
# u’…’
# u’\n’
These strings tend to have a lot of extra whitespace, which you can
remove by using the “.stripped_strings“ generator instead::
for string in soup.stripped_strings:
print(repr(string))
# u”The Dormouse’s story”
# u”The Dormouse’s story”
# u’Once upon a time there were three little sisters; and their names were’
# u’Elsie’
# u’,’
# u’Lacie’
# u’and’
# u’Tillie’
# u’;\nand they lived at the bottom of a well.’
# u’…’
Here, strings consisting entirely of whitespace are ignored, and
whitespace at the beginning and end of strings is removed.
Going up
——–
Continuing the “family tree” analogy, every tag and every string has a
`parent`: the tag that contains it.
.. _.parent:
“.parent“
^^^^^^^^^^^
You can access an element’s parent with the “.parent“ attribute. In
the example “three sisters” document, the tag is the parent
of the
title_tag.parent
#
The title string itself has a parent: the
The parent of a top-level tag like is the “BeautifulSoup“ object
itself::
html_tag = soup.html
type(html_tag.parent)
#
And the “.parent“ of a “BeautifulSoup“ object is defined as None::
print(soup.parent)
# None
.. _.parents:
“.parents“
^^^^^^^^^^^^
You can iterate over all of an element’s parents with
“.parents“. This example uses “.parents“ to travel from an tag
buried deep within the document, to the very top of the document::
link = soup.a
link
#
Elsie
for parent in link.parents:
if parent is None:
print(parent)
else:
print(parent.name)
# p
# body
# html
# [document]
# None
Going sideways
————–
Consider a simple document like this::
sibling_soup = BeautifulSoup(“text1text2”)
print(sibling_soup.prettify())
#
#
#
#
# text1
#
#
# text2
#
#
#
#
The tag and the tag are at the same level: they’re both direct
children of the same tag. We call them `siblings`. When a document is
pretty-printed, siblings show up at the same indentation level. You
can also use this relationship in the code you write.
“.next_sibling“ and “.previous_sibling“
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can use “.next_sibling“ and “.previous_sibling“ to navigate
between page elements that are on the same level of the parse tree::
sibling_soup.b.next_sibling
# text2
sibling_soup.c.previous_sibling
# text1
The tag has a “.next_sibling“, but no “.previous_sibling“,
because there’s nothing before the tag `on the same level of the
tree`. For the same reason, the tag has a “.previous_sibling“
but no “.next_sibling“::
print(sibling_soup.b.previous_sibling)
# None
print(sibling_soup.c.next_sibling)
# None
The strings “text1” and “text2” are `not` siblings, because they don’t
have the same parent::
sibling_soup.b.string
# u’text1′
print(sibling_soup.b.string.next_sibling)
# None
In real documents, the “.next_sibling“ or “.previous_sibling“ of a
tag will usually be a string containing whitespace. Going back to the
“three sisters” document::
Elsie
Lacie
Tillie
You might think that the “.next_sibling“ of the first tag would
be the second
tag. But actually, it’s a string: the comma and
newline that separate the first
tag from the second::
link = soup.a
link
#
Elsie
link.next_sibling
# u’,\n’
The second tag is actually the “.next_sibling“ of the comma::
link.next_sibling.next_sibling
#
Lacie
.. _sibling-generators:
“.next_siblings“ and “.previous_siblings“
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can iterate over a tag’s siblings with “.next_siblings“ or
“.previous_siblings“::
for sibling in soup.a.next_siblings:
print(repr(sibling))
# u’,\n’
# Lacie
# u’ and\n’
# Tillie
# u’; and they lived at the bottom of a well.’
# None
for sibling in soup.find(id=”link3″).previous_siblings:
print(repr(sibling))
# ‘ and\n’
# Lacie
# u’,\n’
# Elsie
# u’Once upon a time there were three little sisters; and their names were\n’
# None
Going back and forth
——————–
Take a look at the beginning of the “three sisters” document::
The Dormouse’s story
An HTML parser takes this string of characters and turns it into a
series of events: “open an tag”, “open a tag”, “open a
tag”, and so on. Beautiful Soup offers tools for reconstructing the
initial parse of the document.
.. _element-generators:
“.next_element“ and “.previous_element“
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The “.next_element“ attribute of a string or tag points to whatever
was parsed immediately afterwards. It might be the same as
“.next_sibling“, but it’s usually drastically different.
Here’s the final tag in the “three sisters” document. Its
“.next_sibling“ is a string: the conclusion of the sentence that was
interrupted by the start of the
tag.::
last_a_tag = soup.find(“a”, id=”link3″)
last_a_tag
#
Tillie
last_a_tag.next_sibling
# ‘; and they lived at the bottom of a well.’
But the “.next_element“ of that tag, the thing that was parsed
immediately after the
tag, is `not` the rest of that sentence:
it’s the word “Tillie”::
last_a_tag.next_element
# u’Tillie’
That’s because in the original markup, the word “Tillie” appeared
before that semicolon. The parser encountered an
tag, then the
word “Tillie”, then the closing tag, then the semicolon and rest of
the sentence. The semicolon is on the same level as the tag, but the
word “Tillie” was encountered first.
The “.previous_element“ attribute is the exact opposite of
“.next_element“. It points to whatever element was parsed
immediately before this one::
last_a_tag.previous_element
# u’ and\n’
last_a_tag.previous_element.next_element
#
Tillie
“.next_elements“ and “.previous_elements“
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You should get the idea by now. You can use these iterators to move
forward or backward in the document as it was parsed::
for element in last_a_tag.next_elements:
print(repr(element))
# u’Tillie’
# u’;\nand they lived at the bottom of a well.’
# u’\n\n’
#
…
# u’…’
# u’\n’
# None
Searching the tree
==================
Beautiful Soup defines a lot of methods for searching the parse tree,
but they’re all very similar. I’m going to spend a lot of time explain
the two most popular methods: “find()“ and “find_all()“. The other
methods take almost exactly the same arguments, so I’ll just cover
them briefly.
Once again, I’ll be using the “three sisters” document as an example::
html_doc = “””
The Dormouse’s story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
…
“””
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
By passing in a filter to an argument like “find_all()“, you can
isolate whatever parts of the document you’re interested.
Kinds of filters
—————-
Before talking in detail about “find_all()“ and similar methods, I
want to show examples of different filters you can pass into these
methods. These filters show up again and again, throughout the
search API. You can use them to filter based on a tag’s name,
on its attributes, on the text of a string, or on some combination of
these.
.. _a string:
A string
^^^^^^^^
The simplest filter is a string. Pass a string to a search method and
Beautiful Soup will perform a match against that exact string. This
code finds all the tags in the document::
soup.find_all(‘b’)
# [The Dormouse’s story]
If you pass in a byte string, Beautiful Soup will assume the string is
encoded as UTF-8. You can avoid this by passing in a Unicode string instead.
.. _a regular expression:
A regular expression
^^^^^^^^^^^^^^^^^^^^
If you pass in a regular expression object, Beautiful Soup will filter
against that regular expression. This code finds all the tags whose
names start with the letter “b”; in this case, the tag and the
tag::
import re
for tag in soup.find_all(re.compile(“b.*”)):
print(tag.name)
# body
# b
.. _a list:
A list
^^^^^^
If you pass in a list, Beautiful Soup will allow a string match
against `any` item in that list. This code finds all the tags
`and` all the tags::
soup.find_all([“a”, “b”])
# [The Dormouse’s story,
# Elsie,
# Lacie,
# Tillie]
.. _the value True:
“True“
^^^^^^^^
The value “True“ matches everything it can. This code finds `all`
the tags in the document, but none of the text strings::
for tag in soup.find_all(True):
print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p
.. a function:
A function
^^^^^^^^^^
If none of the other matches work for you, define a function that
takes an element as its only argument. The function should return
“True“ if the argument matches, and “False“ otherwise.
Here’s a function that returns “True“ if a tag defines the “class”
attribute but doesn’t define the “id” attribute::
def has_class_but_no_id(tag):
return tag.has_key(‘class’) and not tag.has_key(‘id’)
Pass this function into “find_all()“ and you’ll pick up all the
tags::
soup.find_all(has_class_but_no_id)
# [
The Dormouse’s story
,
# Once upon a time there were…
,
# …
]
This function only picks up the tags. It doesn’t pick up the
tags, because those tags define both “class” and “id”. It doesn’t pick
up tags like
and ]
soup.find_all(“p”, “title”)
# [The Dormouse’s story
]
soup.find_all(“a”)
# [Elsie,
# Lacie,
# Tillie]
soup.find_all(id=”link2″)
# [Lacie]
import re
soup.find(text=re.compile(“sisters”))
# u’Once upon a time there were three little sisters; and their names were\n’
Some of these should look familiar, but others are new. What does it
mean to pass in a value for “text“, or “id“? Why does
“find_all(“p”, “title”)“ find a tag with the CSS class “title”?
Let’s look at the arguments to “find_all()“.
.. _name:
The “name“ argument
^^^^^^^^^^^^^^^^^^^^^
Pass in a value for “name“ and you’ll tell Beautiful Soup to only
consider tags with certain names. Text strings will be ignored, as
will tags whose names that don’t match.
This is the simplest usage::
soup.find_all(“title”)
# [
]
Recall from `Kinds of filters`_ that the value to “name“ can be `a
string`_, `a regular expression`_, `a list`_, `a function`_, or `the value
True`_.
.. _kwargs:
The keyword arguments
^^^^^^^^^^^^^^^^^^^^^
Any argument that’s not recognized will be turned into a filter on one
of a tag’s attributes. If you pass in a value for an argument called “id“,
Beautiful Soup will filter against each tag’s ‘id’ attribute::
soup.find_all(id=’link2′)
# [Lacie]
If you pass in a value for “href“, Beautiful Soup will filter
against each tag’s ‘href’ attribute::
soup.find_all(href=re.compile(“elsie”))
# [Elsie]
You can filter an attribute based on `a string`_, `a regular
expression`_, `a list`_, `a function`_, or `the value True`_.
This code finds all tags that have an “id“ attribute, regardless of
what the value is::
soup.find_all(id=True)
# [Elsie,
# Lacie,
# Tillie]
You can filter multiple attributes at once by passing in more than one
keyword argument::
soup.find_all(href=re.compile(“elsie”), id=’link1′)
# [three]
.. _attrs:
Searching by CSS class
^^^^^^^^^^^^^^^^^^^^^^
Instead of using keyword arguments, you can filter tags based on their
attributes by passing a dictionary in for “attrs“. These two lines of
code are equivalent::
soup.find_all(href=re.compile(“elsie”), id=’link1′)
soup.find_all(attrs={‘href’ : re.compile(“elsie”), ‘id’: ‘link1’})
The “attrs“ argument would be a pretty obscure feature were it not for
one thing: CSS. It’s very useful to search for a tag that has a
certain CSS class, but the name of the CSS attribute, “class”, is also a
Python reserved word.
You can use “attrs“ to search by CSS class::
soup.find_all(“a”, { “class” : “sister” })
# [Elsie,
# Lacie,
# Tillie]
But that’s a lot of code for such a common operation. Instead, you can
pass a string `attrs` instead of a dictionary. The string will be used
to restrict the CSS class::
soup.find_all(“a”, “sister”)
# [Elsie,
# Lacie,
# Tillie]
You can also pass in a regular expression, a function or
True. Anything you pass in for “attrs“ that’s not a dictionary will
be used to search against the CSS class::
soup.find_all(attrs=re.compile(“itl”))
# [The Dormouse’s story
]
def has_six_characters(css_class):
return css_class is not None and len(css_class) == 6
soup.find_all(attrs=has_six_characters)
# [Elsie,
# Lacie,
# Tillie]
:ref:`Remember ` that a single tag can have multiple
values for its “class” attribute. When you search for a tag that
matches a certain CSS class, you’re matching against `any` of its CSS
classes::
css_soup = BeautifulSoup(‘
‘)
css_soup.find_all(“p”, “strikeout”)
# [
]
css_soup.find_all(“p”, “body”)
# [
]
Searching for the string value of the “class“ attribute won’t work::
css_soup.find_all(“p”, “body strikeout”)
# []
.. _text:
The “text“ argument
^^^^^^^^^^^^^^^^^^^^^
With “text“ you can search for strings instead of tags. As with
“name“ and the keyword arguments, you can pass in `a string`_, `a
regular expression`_, `a list`_, `a function`_, or `the value True`_.
Here are some examples::
soup.find_all(text=”Elsie”)
# [u’Elsie’]
soup.find_all(text=[“Tillie”, “Elsie”, “Lacie”])
# [u’Elsie’, u’Lacie’, u’Tillie’]
soup.find_all(text=re.compile(“Dormouse”))
[u”The Dormouse’s story”, u”The Dormouse’s story”]
def is_the_only_string_within_a_tag(s):
“””Return True if this string is the only child of its parent tag.”””
return (s == s.parent.string)
soup.find_all(text=is_the_only_string_within_a_tag)
# [u”The Dormouse’s story”, u”The Dormouse’s story”, u’Elsie’, u’Lacie’, u’Tillie’, u’…’]
Although “text“ is for finding strings, you can combine it with
arguments for finding tags, Beautiful Soup will find all tags whose
“.string“ matches your value for “text“. This code finds the
tags whose “.string“ is “Elsie”::
soup.find_all(“a”, text=”Elsie”)
# [
Elsie]
.. _limit:
The “limit“ argument
^^^^^^^^^^^^^^^^^^^^^^
“find_all()“ returns all the tags and strings that match your
filters. This can take a while if the document is large. If you don’t
need `all` the results, you can pass in a number for “limit“. This
works just like the LIMIT keyword in SQL. It tells Beautiful Soup to
stop gathering results after it’s found a certain number.
There are three links in the “three sisters” document, but this code
only finds the first two::
soup.find_all(“a”, limit=2)
# [Elsie,
# Lacie]
.. _recursive:
The “recursive“ argument
^^^^^^^^^^^^^^^^^^^^^^^^^^
If you call “mytag.find_all()“, Beautiful Soup will examine all the
descendants of “mytag“: its children, its children’s children, and
so on. If you only want Beautiful Soup to consider direct children,
you can pass in “recursive=False“. See the difference here::
soup.html.find_all(“title”)
# []
soup.html.find_all(“title”, recursive=False)
# []
Here’s that part of the document::
…
The tag, but it’s not `directly`
beneath the tag: the tag is in the way. Beautiful Soup
finds the tag, but when “recursive=False“ restricts it to the
tag’s immediate children, it finds nothing.
Beautiful Soup offers a lot of tree-searching methods (covered below),
and they mostly take the same arguments as “find_all()“: “name“,
“attrs“, “text“, “limit“, and the keyword arguments. But the
“recursive“ argument is different: “find_all()“ and “find()“ are
the only methods that support it. Passing “recursive=False“ into a
method like “find_parents()“ wouldn’t be very useful.
Calling a tag is like calling “find_all()“
——————————————–
Because “find_all()“ is the most popular method in the Beautiful
Soup search API, you can use a shortcut for it. If you treat the
“BeautifulSoup“ object or a “Tag“ object as though it were a
function, then it’s the same as calling “find_all()“ on that
object. These two lines of code are equivalent::
soup.find_all(“a”)
soup(“a”)
These two lines are also equivalent::
soup.title.find_all(text=True)
soup.title(text=True)
“find()“
———-
Signature: find(:ref:`name `, :ref:`attrs `, :ref:`recursive
`, :ref:`text `, :ref:`**kwargs `)
The “find_all()“ method scans the entire document looking for
results, but sometimes you only want to find one result. If you know a
document only has one tag, it’s a waste of time to scan the
entire document looking for more. Rather than passing in “limit=1“
every time you call “find_all“, you can use the “find()“
method. These two lines of code are `nearly` equivalent::
soup.find_all(‘title’, limit=1)
# []
soup.find(‘title’)
#
The only difference is that “find_all()“ returns a list containing
the single result, and “find()“ just returns the result.
If “find_all()“ can’t find anything, it returns an empty list. If
“find()“ can’t find anything, it returns “None“::
print(soup.find(“nosuchtag”))
# None
Remember the “soup.head.title“ trick from `Navigating using tag
names`_? That trick works by repeatedly calling “find()“::
soup.head.title
#
soup.find(“head”).find(“title”)
#
“find_parents()“ and “find_parent()“
—————————————-
Signature: find_parents(:ref:`name `, :ref:`attrs `, :ref:`text `, :ref:`limit `, :ref:`**kwargs `)
Signature: find_parent(:ref:`name `, :ref:`attrs `, :ref:`text `, :ref:`**kwargs `)
I spent a lot of time above covering “find_all()“ and
“find()“. The Beautiful Soup API defines ten other methods for
searching the tree, but don’t be afraid. Five of these methods are
basically the same as “find_all()“, and the other five are basically
the same as “find()“. The only differences are in what parts of the
tree they search.
First let’s consider “find_parents()“ and
“find_parent()“. Remember that “find_all()“ and “find()“ work
their way down the tree, looking at tag’s descendants. These methods
do the opposite: they work their way `up` the tree, looking at a tag’s
(or a string’s) parents. Let’s try them out, starting from a string
buried deep in the “three daughters” document::
a_string = soup.find(text=”Lacie”)
a_string
# u’Lacie’
a_string.find_parents(“a”)
# [Lacie]
a_string.find_parent(“p”)
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
a_string.find_parents(“p”, class=”title”)
# []
One of the three tags is the direct parent of the string in
question, so our search finds it. One of the three
tags is an
indirect parent of the string, and our search finds that as
well. There’s a
tag with the CSS class “title” `somewhere` in the
document, but it’s not one of this string’s parents, so we can’t find
it with “find_parents()“.
You may have made the connection between “find_parent()“ and
“find_parents()“, and the `.parent`_ and `.parents`_ attributes
mentioned earlier. The connection is very strong. These search methods
actually use “.parents“ to iterate over all the parents, and check
each one against the provided filter to see if it matches.
“find_next_siblings()“ and “find_next_sibling()“
—————————————————-
Signature: find_next_siblings(:ref:`name `, :ref:`attrs `, :ref:`text `, :ref:`limit `, :ref:`**kwargs `)
Signature: find_next_sibling(:ref:`name `, :ref:`attrs `, :ref:`text `, :ref:`**kwargs `)
These methods use :ref:`.next_siblings ` to
iterate over the rest of an element’s siblings in the tree. The
“find_next_siblings()“ method returns all the siblings that match,
and “find_next_sibling()“ only returns the first one::
first_link = soup.a
first_link
# Elsie
first_link.find_next_siblings(“a”)
# [Lacie,
# Tillie]
first_story_paragraph = soup.find(“p”, “story”)
first_story_paragraph.find_next_sibling(“p”)
#
…
“find_previous_siblings()“ and “find_previous_sibling()“
————————————————————
Signature: find_previous_siblings(:ref:`name `, :ref:`attrs `, :ref:`text `, :ref:`limit `, :ref:`**kwargs `)
Signature: find_previous_sibling(:ref:`name `, :ref:`attrs `, :ref:`text `, :ref:`**kwargs `)
These methods use :ref:`.previous_siblings ` to iterate over an element’s
siblings that precede it in the tree. The “find_previous_siblings()“
method returns all the siblings that match, and
“find_previous_sibling()“ only returns the first one::
last_link = soup.find(“a”, id=”link3″)
last_link
# Tillie
last_link.find_previous_siblings(“a”)
# [Lacie,
# Elsie]
first_story_paragraph = soup.find(“p”, “story”)
first_story_paragraph.find_previous_sibling(“p”)
# The Dormouse’s story
“find_all_next()“ and “find_next()“
—————————————
Signature: find_all_next(:ref:`name `, :ref:`attrs `, :ref:`text `, :ref:`limit `, :ref:`**kwargs `)
Signature: find_next(:ref:`name `, :ref:`attrs `, :ref:`text `, :ref:`**kwargs `)
These methods use :ref:`.next_elements ` to
iterate over whatever tags and strings that come after it in the
document. The “find_all_next()“ method returns all matches, and
“find_next()“ only returns the first match::
first_link = soup.a
first_link
# Elsie
first_link.find_all_next(text=True)
# [u’Elsie’, u’,\n’, u’Lacie’, u’ and\n’, u’Tillie’,
# u’;\nand they lived at the bottom of a well.’, u’\n\n’, u’…’, u’\n’]
first_link.find_next(“p”)
# …
In the first example, the string “Elsie” showed up, even though it was
contained within the tag we started from. In the second example,
the last
tag in the document showed up, even though it’s not in
the same part of the tree as the tag we started from. For these
methods, all that matters is that an element match the filter, and
show up later in the document than the starting element.
“find_all_previous()“ and “find_previous()“
———————————————–
Signature: find_all_previous(:ref:`name `, :ref:`attrs `, :ref:`text `, :ref:`limit `, :ref:`**kwargs `)
Signature: find_previous(:ref:`name `, :ref:`attrs `, :ref:`text `, :ref:`**kwargs `)
These methods use :ref:`.previous_elements ` to
iterate over the tags and strings that came before it in the
document. The “find_all_previous()“ method returns all matches, and
“find_previous()“ only returns the first match::
first_link = soup.a
first_link
#
Elsie
first_link.find_all_previous(“p”)
# [
Once upon a time there were three little sisters; …
,
# The Dormouse’s story
]
first_link.find_previous(“title”)
#
The call to “find_all_previous(“p”)“ found the first paragraph in
the document (the one with class=”title”), but it also finds the
second paragraph, the tag that contains the tag we started
with. This shouldn’t be too surprising: we’re looking at all the tags
that show up earlier in the document than the one we started with. A
tag that contains an tag must have shown up before the
tag it contains.
CSS selectors
————-
Beautiful Soup supports a subset of the `CSS selector standard
`_. Just construct the
selector as a string and pass it into the “.select()“ method of a
“Tag“ or the “BeautifulSoup“ object itself.
You can find tags::
soup.select(“title”)
# [
]
Find tags beneath other tags::
soup.select(“body a”)
# [Elsie,
# Lacie,
# Tillie]
soup.select(“html head title”)
# []
Find tags `directly` beneath other tags::
soup.select(“head > title”)
# []
soup.select(“p > a”)
# [Elsie,
# Lacie,
# Tillie]
soup.select(“body > a”)
# []
Find tags by CSS class::
soup.select(“.sister”)
# [Elsie,
# Lacie,
# Tillie]
soup.select(“[class~=sister]”)
# [Elsie,
# Lacie,
# Tillie]
Find tags by ID::
soup.select(“#link1”)
# [Elsie]
soup.select(“a#link2″)
# [Lacie]
Test for the existence of an attribute::
soup.select(‘a[href]’)
# [Elsie,
# Lacie,
# Tillie]
Find tags by attribute value::
soup.select(‘a[href=”http://example.com/elsie”]’)
# [Elsie]
soup.select(‘a[href^=”http://example.com/”]’)
# [Elsie,
# Lacie,
# Tillie]
soup.select(‘a[href$=”tillie”]’)
# [Tillie]
soup.select(‘a[href*=”.com/el”]’)
# [Elsie]
Match language codes::
multilingual_markup = “””
Hello
Howdy, y’all
Pip-pip, old fruit
Bonjour mes amis
“””
multilingual_soup = BeautifulSoup(multilingual_markup)
multilingual_soup.select(‘p[lang|=en]’)
# [Hello
,
# Howdy, y’all
,
# Pip-pip, old fruit
]
This is a convenience for users who know the CSS selector syntax. You
can do all this stuff with the Beautiful Soup API. And if CSS
selectors are all you need, you might as well use lxml directly,
because it’s faster. But this lets you `combine` simple CSS selectors
with the Beautiful Soup API.
Modifying the tree
==================
Beautiful Soup’s main strength is in searching the parse tree, but you
can also modify the tree and write your changes as a new HTML or XML
document.
Changing tag names and attributes
———————————
I covered this earlier, in `Attributes`_, but it bears repeating. You
can rename a tag, change the values of its attributes, add new
attributes, and delete attributes::
soup = BeautifulSoup(‘Extremely bold’)
tag = soup.b
tag.name = “blockquote”
tag[‘class’] = ‘verybold’
tag[‘id’] = 1
tag
# Extremely bold
del tag[‘class’]
del tag[‘id’]
tag
# Extremely bold
Modifying “.string“
———————
If you set a tag’s “.string“ attribute, the tag’s contents are
replaced with the string you give::
markup = ‘I linked to example.com’
soup = BeautifulSoup(markup)
tag = soup.a
tag.string = “New link text.”
tag
# New link text.
Be careful: if the tag contained other tags, they and all their
contents will be destroyed.
“append()“
————
You can add to a tag’s contents with “Tag.append()“. It works just
like calling “.append()“ on a Python list::
soup = BeautifulSoup(“Foo”)
soup.a.append(“Bar”)
soup
# FooBar
soup.a.contents
# [u’Foo’, u’Bar’]
“BeautifulSoup.new_string()“ and “.new_tag()“
————————————————-
If you need to add a string to a document, no problem–you can pass a
Python string in to “append()“, or you can call the factory method
“BeautifulSoup.new_string()“::
soup = BeautifulSoup(“”)
tag = soup.b
tag.append(“Hello”)
new_string = soup.new_string(” there”)
tag.append(new_string)
tag
# Hello there.
tag.contents
# [u’Hello’, u’ there’]
What if you need to create a whole new tag? The best solution is to
call the factory method “BeautifulSoup.new_tag()“::
soup = BeautifulSoup(“”)
original_tag = soup.b
new_tag = soup.new_tag(“a”, href=”http://www.example.com”)
original_tag.append(new_tag)
original_tag
#
new_tag.string = “Link text.”
original_tag
# Link text.
Only the first argument, the tag name, is required.
“insert()“
————
“Tag.insert()“ is just like “Tag.append()“, except the new element
doesn’t necessarily go at the end of its parent’s
“.contents“. It’ll be inserted at whatever numeric position you
say. It works just like “.insert()“ on a Python list::
markup = ‘I linked to example.com’
soup = BeautifulSoup(markup)
tag = soup.a
tag.insert(1, “but did not endorse “)
tag
# I linked to but did not endorse example.com
tag.contents
# [u’I linked to ‘, u’but did not endorse’, example.com]
“insert_before()“ and “insert_after()“
——————————————
The “insert_before()“ method inserts a tag or string immediately
before something else in the parse tree::
soup = BeautifulSoup(“stop”)
tag = soup.new_tag(“i”)
tag.string = “Don’t”
soup.b.string.insert_before(tag)
soup.b
# Don’tstop
The “insert_after()“ method moves a tag or string so that it
immediately follows something else in the parse tree::
soup.b.i.insert_after(soup.new_string(” ever “))
soup.b
# Don’t ever stop
soup.b.contents
# [Don’t, u’ ever ‘, u’stop’]
“clear()“
———–
“Tag.clear()“ removes the contents of a tag::
markup = ‘I linked to example.com’
soup = BeautifulSoup(markup)
tag = soup.a
tag.clear()
tag
#
“extract()“
————-
“PageElement.extract()“ removes a tag or string from the tree. It
returns the tag or string that was extracted::
markup = ‘I linked to example.com’
soup = BeautifulSoup(markup)
a_tag = soup.a
i_tag = soup.i.extract()
a_tag
# I linked to
i_tag
# example.com
print(i_tag.parent)
None
At this point you effectively have two parse trees: one rooted at the
“BeautifulSoup“ object you used to parse the document, and one rooted
at the tag that was extracted. You can go on to call “extract“ on
a child of the element you extracted::
my_string = i_tag.string.extract()
my_string
# u’example.com’
print(my_string.parent)
# None
i_tag
#
“decompose()“
—————
“Tag.decompose()“ removes a tag from the tree, then `completely
destroys it and its contents`::
markup = ‘I linked to example.com’
soup = BeautifulSoup(markup)
a_tag = soup.a
soup.i.decompose()
a_tag
# I linked to
.. _replace_with:
“replace_with()“
——————
“PageElement.replace_with()“ removes a tag or string from the tree,
and replaces it with the tag or string of your choice::
markup = ‘I linked to example.com’
soup = BeautifulSoup(markup)
a_tag = soup.a
new_tag = soup.new_tag(“b”)
new_tag.string = “example.net”
a_tag.i.replace_with(new_tag)
a_tag
# I linked to example.net
“replace_with()“ returns the tag or string that was replaced, so
that you can examine it or add it back to another part of the tree.
“wrap()“
———-
“PageElement.wrap()“ wraps an element in the tag you specify. It
returns the new wrapper::
soup = BeautifulSoup(“I wish I was bold.
“)
soup.p.string.wrap(soup.new_tag(“b”))
# I wish I was bold.
soup.p.wrap(soup.new_tag(“div”)
# I wish I was bold.
This method is new in Beautiful Soup 4.0.5.
“unwrap()“
—————————
“Tag.unwrap()“ is the opposite of “wrap()“. It replaces a tag with
whatever’s inside that tag. It’s good for stripping out markup::
markup = ‘I linked to example.com’
soup = BeautifulSoup(markup)
a_tag = soup.a
a_tag.i.unwrap()
a_tag
# I linked to example.com
Like “replace_with()“, “unwrap()“ returns the tag
that was replaced.
(In earlier versions of Beautiful Soup, “unwrap()“ was called
“replace_with_children()“, and that name will still work.)
Output
======
Pretty-printing
—————
The “prettify()“ method will turn a Beautiful Soup parse tree into a
nicely formatted bytestring, with each HTML/XML tag on its own line::
markup = ‘I linked to example.com’
soup = BeautifulSoup(markup)
soup.prettify()
# ‘\n \n \n \n \n…’
print(soup.prettify())
#
#
#
#
#
# I linked to
#
# example.com
#
#
#
#
You can call “prettify()“ on the top-level “BeautifulSoup“ object,
or on any of its “Tag“ objects::
print(soup.a.prettify())
#
# I linked to
#
# example.com
#
#
Non-pretty printing
——————-
If you just want a string, with no fancy formatting, you can call
“unicode()“ or “str()“ on a “BeautifulSoup“ object, or a “Tag“
within it::
str(soup)
# ‘I linked to example.com’
unicode(soup.a)
# u’I linked to example.com’
The “str()“ function returns a string encoded in UTF-8. See
`Encodings`_ for other options.
You can also call “encode()“ to get a bytestring, and “decode()“
to get Unicode.
.. _output_formatters:
Output formatters
—————–
If you give Beautiful Soup a document that contains HTML entities like
“&lquot;”, they’ll be converted to Unicode characters::
soup = BeautifulSoup(““Dammit!” he said.”)
unicode(soup)
# u’\u201cDammit!\u201d he said.’
If you then convert the document to a string, the Unicode characters
will be encoded as UTF-8. You won’t get the HTML entities back::
str(soup)
# ‘\xe2\x80\x9cDammit!\xe2\x80\x9d he said.’
By default, the only characters that are escaped upon output are bare
ampersands and angle brackets. These get turned into “&”, “<",
and ">“, so that Beautiful Soup doesn’t inadvertently generate
invalid HTML or XML::
soup = BeautifulSoup(“The law firm of Dewey, Cheatem, & Howe
“)
soup.p
# The law firm of Dewey, Cheatem, & Howe
soup = BeautifulSoup(‘A link’)
soup.a
# A link
You can change this behavior by providing a value for the
“formatter“ argument to “prettify()“, “encode()“, or
“decode()“. Beautiful Soup recognizes four possible values for
“formatter“.
The default is “formatter=”minimal”“. Strings will only be processed
enough to ensure that Beautiful Soup generates valid HTML/XML::
french = “Il a dit <
”
soup = BeautifulSoup(french)
print(soup.prettify(formatter=”minimal”))
#
#
#
# Il a dit <
#
#
#
If you pass in “formatter=”html”“, Beautiful Soup will convert
Unicode characters to HTML entities whenever possible::
print(soup.prettify(formatter=”html”))
#
#
#
# Il a dit <
#
#
#
If you pass in “formatter=None“, Beautiful Soup will not modify
strings at all on output. This is the fastest option, but it may lead
to Beautiful Soup generating invalid HTML/XML, as in these examples::
print(soup.prettify(formatter=None))
#
#
#
# Il a dit >
#
#
#
link_soup = BeautifulSoup(‘A link’)
print(link_soup.a.encode(formatter=None))
# A link
Finally, if you pass in a function for “formatter“, Beautiful Soup
will call that function once for every string and attribute value in
the document. You can do whatever you want in this function. Here’s a
formatter that converts strings to uppercase and does absolutely
nothing else::
def uppercase(str):
return str.upper()
print(soup.prettify(formatter=uppercase))
#
#
#
# IL A DIT >
#
#
#
print(link_soup.a.prettify(formatter=uppercase))
#
# A LINK
#
If you’re writing your own function, you should know about the
“EntitySubstitution“ class in the “bs4.dammit“ module. This class
implements Beautiful Soup’s standard formatters as class methods: the
“html” formatter is “EntitySubstitution.substitute_html“, and the
“minimal” formatter is “EntitySubstitution.substitute_xml“. You can
use these functions to simulate “formatter=html“ or
“formatter==minimal“, but then do something extra.
Here’s an example that replaces Unicode characters with HTML entities
whenever possible, but `also` converts all strings to uppercase::
from bs4.dammit import EntitySubstitution
def uppercase_and_substitute_html_entities(str):
return EntitySubstitution.substitute_html(str.upper())
print(soup.prettify(formatter=uppercase_and_substitute_html_entities))
#
#
#
# IL A DIT <
#
#
#
One last caveat: if you create a “CData“ object, the text inside
that object is always presented `exactly as it appears, with no
formatting`. Beautiful Soup will call the formatter method, just in
case you’ve written a custom method that counts all the strings in the
document or something, but it will ignore the return value.
from bs4.element import CData
soup = BeautifulSoup(“”)
soup.a.string = CData(“one < three")
print(soup.a.prettify(formatter="xml"))
#
# one < three
#
``get_text()``
--------------
If you only want the text part of a document or tag, you can use the
``get_text()`` method. It returns all the text in a document or
beneath a tag, as a single Unicode string::
markup = '\nI linked to example.com\n'
soup = BeautifulSoup(markup)
soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'
You can specify a string to be used to join the bits of text
together::
# soup.get_text("|")
u'\nI linked to |example.com|\n'
You can tell Beautiful Soup to strip whitespace from the beginning and
end of each bit of text::
# soup.get_text("|", strip=True)
u'I linked to|example.com'
But at that point you might want to use the :ref:`.stripped_strings `
generator instead, and process the text yourself::
[text for text in soup.stripped_strings]
# [u'I linked to', u'example.com']
Specifying the parser to use
============================
If you just need to parse some HTML, you can dump the markup into the
``BeautifulSoup`` constructor, and it'll probably be fine. Beautiful
Soup will pick a parser for you and parse the data. But there are a
few additional arguments you can pass in to the constructor to change
which parser is used.
The first argument to the ``BeautifulSoup`` constructor is a string or
an open filehandle--the markup you want parsed. The second argument is
`how` you'd like the markup parsed.
If you don't specify anything, you'll get the best HTML parser that's
installed. Beautiful Soup ranks lxml's parser as being the best, then
html5lib's, then Python's built-in parser. You can override this by
specifying one of the following:
* What type of markup you want to parse. Currently supported are
"html", "xml", and "html5".
* The name of the parser library you want to use. Currently supported
options are "lxml", "html5lib", and "html.parser" (Python's
built-in HTML parser).
The section `Installing a parser`_ contrasts the supported parsers.
If you don't have an appropriate parser installed, Beautiful Soup will
ignore your request and pick a different parser. Right now, the only
supported XML parser is lxml. If you don't have lxml installed, asking
for an XML parser won't give you one, and asking for "lxml" won't work
either.
Differences between parsers
---------------------------
Beautiful Soup presents the same interface to a number of different
parsers, but each parser is different. Different parsers will create
different parse trees from the same document. The biggest differences
are between the HTML parsers and the XML parsers. Here's a short
document, parsed as HTML::
BeautifulSoup("")
#
Since an empty tag is not valid HTML, the parser turns it into a
tag pair.
Here's the same document parsed as XML (running this requires that you
have lxml installed). Note that the empty tag is left alone, and
that the document is given an XML declaration instead of being put
into an tag.::
BeautifulSoup("", "xml")
#
#
There are also differences between HTML parsers. If you give Beautiful
Soup a perfectly-formed HTML document, these differences won't
matter. One parser will be faster than another, but they'll all give
you a data structure that looks exactly like the original HTML
document.
But if the document is not perfectly-formed, different parsers will
give different results. Here's a short, invalid document parsed using
lxml's HTML parser. Note that the dangling tag is simply
ignored::
BeautifulSoup("", "lxml")
#
Here's the same document parsed using html5lib::
BeautifulSoup("", "html5lib")
#
Instead of ignoring the dangling tag, html5lib pairs it with an
opening tag. This parser also adds an empty
tag to the
document.
Here's the same document parsed with Python's built-in HTML
parser::
BeautifulSoup("", "html.parser")
#
Like html5lib, this parser ignores the closing tag. Unlike
html5lib, this parser makes no attempt to create a well-formed HTML
document by adding a tag. Unlike lxml, it doesn't even bother
to add an tag.
Since the document "" is invalid, none of these techniques is
the "correct" way to handle it. The html5lib parser uses techniques
that are part of the HTML5 standard, so it has the best claim on being
the "correct" way, but all three techniques are legitimate.
Differences between parsers can affect your script. If you're planning
on distributing your script to other people, or running it on multiple
machines, you should specify a parser in the ``BeautifulSoup``
constructor. That will reduce the chances that your users parse a
document differently from the way you parse it.
Encodings
=========
Any HTML or XML document is written in a specific encoding like ASCII
or UTF-8. But when you load that document into Beautiful Soup, you'll
discover it's been converted to Unicode::
markup = "
Sacr\xc3\xa9 bleu!
"
soup = BeautifulSoup(markup)
soup.h1
# Sacré bleu!
soup.h1.string
# u'Sacr\xe9 bleu!'
It's not magic. (That sure would be nice.) Beautiful Soup uses a
sub-library called `Unicode, Dammit`_ to detect a document's encoding
and convert it to Unicode. The autodetected encoding is available as
the ``.original_encoding`` attribute of the ``BeautifulSoup`` object::
soup.original_encoding
'utf-8'
Unicode, Dammit guesses correctly most of the time, but sometimes it
makes mistakes. Sometimes it guesses correctly, but only after a
byte-by-byte search of the document that takes a very long time. If
you happen to know a document's encoding ahead of time, you can avoid
mistakes and delays by passing it to the ``BeautifulSoup`` constructor
as ``from_encoding``.
Here's a document written in ISO-8859-8. The document is so short that
Unicode, Dammit can't get a good lock on it, and misidentifies it as
ISO-8859-7::
markup = b"\xed\xe5\xec\xf9
"
soup = BeautifulSoup(markup)
soup.h1
νεμω
soup.original_encoding
'ISO-8859-7'
We can fix this by passing in the correct ``from_encoding``::
soup = BeautifulSoup(markup, from_encoding="iso-8859-8")
soup.h1
םולש
soup.original_encoding
'iso8859-8'
In rare cases (usually when a UTF-8 document contains text written in
a completely different encoding), the only way to get Unicode may be
to replace some characters with the special Unicode character
"REPLACEMENT CHARACTER" (U+FFFD, �). If Unicode, Dammit needs to do
this, it will set the ``.contains_replacement_characters`` attribute
to ``True`` on the ``UnicodeDammit`` or ``BeautifulSoup`` object. This
lets you know that the Unicode representation is not an exact
representation of the original--some data was lost. If a document
contains �, but ``.contains_replacement_characters`` is ``False``,
you'll know that the � was there originally (as it is in this
paragraph) and doesn't stand in for missing data.
Output encoding
---------------
When you write out a document from Beautiful Soup, you get a UTF-8
document, even if the document wasn't in UTF-8 to begin with. Here's a
document written in the Latin-1 encoding::
markup = b'''
Sacr\xe9 bleu!
'''
soup = BeautifulSoup(markup)
print(soup.prettify())
#
#
#
#
#
#
# Sacré bleu!
#
#
#
Note that the tag has been rewritten to reflect the fact that
the document is now in UTF-8.
If you don't want UTF-8, you can pass an encoding into ``prettify()``::
print(soup.prettify("latin-1"))
#
#
#
# ...
You can also call encode() on the ``BeautifulSoup`` object, or any
element in the soup, just as if it were a Python string::
soup.p.encode("latin-1")
# 'Sacr\xe9 bleu!
'
soup.p.encode("utf-8")
# 'Sacr\xc3\xa9 bleu!
'
Any characters that can't be represented in your chosen encoding will
be converted into numeric XML entity references. Here's a document
that includes the Unicode character SNOWMAN::
markup = u"\N{SNOWMAN}"
snowman_soup = BeautifulSoup(markup)
tag = snowman_soup.b
The SNOWMAN character can be part of a UTF-8 document (it looks like
☃), but there's no representation for that character in ISO-Latin-1 or
ASCII, so it's converted into "☃" for those encodings::
print(tag.encode("utf-8"))
# ☃
print tag.encode("latin-1")
# ☃
print tag.encode("ascii")
# ☃
Unicode, Dammit
---------------
You can use Unicode, Dammit without using Beautiful Soup. It's useful
whenever you have data in an unknown encoding and you just want it to
become Unicode::
from bs4 import UnicodeDammit
dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!")
print(dammit.unicode_markup)
# Sacré bleu!
dammit.original_encoding
# 'utf-8'
The more data you give Unicode, Dammit, the more accurately it will
guess. If you have your own suspicions as to what the encoding might
be, you can pass them in as a list::
dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"])
print(dammit.unicode_markup)
# Sacré bleu!
dammit.original_encoding
# 'latin-1'
Unicode, Dammit has two special features that Beautiful Soup doesn't
use.
Smart quotes
^^^^^^^^^^^^
You can use Unicode, Dammit to convert Microsoft smart quotes to HTML or XML
entities::
markup = b"I just \x93love\x94 Microsoft Word\x92s smart quotes
"
UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup
# u'I just “love” Microsoft Word’s smart quotes
'
UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup
# u'I just “love” Microsoft Word’s smart quotes
'
You can also convert Microsoft smart quotes to ASCII quotes::
UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="ascii").unicode_markup
# u'I just "love" Microsoft Word\'s smart quotes
'
Hopefully you'll find this feature useful, but Beautiful Soup doesn't
use it. Beautiful Soup prefers the default behavior, which is to
convert Microsoft smart quotes to Unicode characters along with
everything else::
UnicodeDammit(markup, ["windows-1252"]).unicode_markup
# u'I just \u201clove\u201d Microsoft Word\u2019s smart quotes
'
Inconsistent encodings
^^^^^^^^^^^^^^^^^^^^^^
Sometimes a document is mostly in UTF-8, but contains Windows-1252
characters such as (again) Microsoft smart quotes. This can happen
when a website includes data from multiple sources. You can use
``UnicodeDammit.detwingle()`` to turn such a document into pure
UTF-8. Here's a simple example::
snowmen = (u"\N{SNOWMAN}" * 3)
quote = (u"\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}")
doc = snowmen.encode("utf8") + quote.encode("windows_1252")
This document is a mess. The snowmen are in UTF-8 and the quotes are
in Windows-1252. You can display the snowmen or the quotes, but not
both::
print(doc)
# ☃☃☃�I like snowmen!�
print(doc.decode("windows-1252"))
# ☃☃☃“I like snowmen!”
Decoding the document as UTF-8 raises a ``UnicodeDecodeError``, and
decoding it as Windows-1252 gives you gibberish. Fortunately,
``UnicodeDammit.detwingle()`` will convert the string to pure UTF-8,
allowing you to decode it to Unicode and display the snowmen and quote
marks simultaneously::
new_doc = UnicodeDammit.detwingle(doc)
print(new_doc.decode("utf8"))
# ☃☃☃“I like snowmen!”
``UnicodeDammit.detwingle()`` only knows how to handle Windows-1252
embedded in UTF-8 (or vice versa, I suppose), but this is the most
common case.
Note that you must know to call ``UnicodeDammit.detwingle()`` on your
data before passing it into ``BeautifulSoup`` or the ``UnicodeDammit``
constructor. Beautiful Soup assumes that a document has a single
encoding, whatever it might be. If you pass it a document that
contains both UTF-8 and Windows-1252, it's likely to think the whole
document is Windows-1252, and the document will come out looking like
` ☃☃☃“I like snowmen!”`.
``UnicodeDammit.detwingle()`` is new in Beautiful Soup 4.1.0.
Parsing only part of a document
===============================
Let's say you want to use Beautiful Soup look at a document's
tags. It's a waste of time and memory to parse the entire document and
then go over it again looking for
tags. It would be much faster to
ignore everything that wasn't an
tag in the first place. The
``SoupStrainer`` class allows you to choose which parts of an incoming
document are parsed. You just create a ``SoupStrainer`` and pass it in
to the ``BeautifulSoup`` constructor as the ``parse_only`` argument.
(Note that *this feature won't work if you're using the html5lib
parser*. If you use html5lib, the whole document will be parsed, no
matter what. This is because html5lib constantly rearranges the parse
tree as it works, and if some part of the document didn't actually
make it into the parse tree, it'll crash. To avoid confusion, in the
examples below I'll be forcing Beautiful Soup to use Python's
built-in parser.)
``SoupStrainer``
----------------
The ``SoupStrainer`` class takes the same arguments as a typical
method from `Searching the tree`_: :ref:`name `, :ref:`attrs
`, :ref:`text `, and :ref:`**kwargs `. Here are
three ``SoupStrainer`` objects::
from bs4 import SoupStrainer
only_a_tags = SoupStrainer("a")
only_tags_with_id_link2 = SoupStrainer(id="link2")
def is_short_string(string):
return len(string) < 10
only_short_strings = SoupStrainer(text=is_short_string)
I'm going to bring back the "three sisters" document one more time,
and we'll see what the document looks like when it's parsed with these
three ``SoupStrainer`` objects::
html_doc = """
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify())
#
# Elsie
#
#
# Lacie
#
#
# Tillie
#
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify())
#
# Lacie
#
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify())
# Elsie
# ,
# Lacie
# and
# Tillie
# ...
#
You can also pass a ``SoupStrainer`` into any of the methods covered
in `Searching the tree`_. This probably isn't terribly useful, but I
thought I'd mention it::
soup = BeautifulSoup(html_doc)
soup.find_all(only_short_strings)
# [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
# u'\n\n', u'...', u'\n']
Troubleshooting
===============
Version mismatch problems
-------------------------
* ``SyntaxError: Invalid syntax`` (on the line ``ROOT_TAG_NAME =
u'[document]'``): Caused by running the Python 2 version of
Beautiful Soup under Python 3, without converting the code.
* ``ImportError: No module named HTMLParser`` - Caused by running the
Python 2 version of Beautiful Soup under Python 3.
* ``ImportError: No module named html.parser`` - Caused by running the
Python 3 version of Beautiful Soup under Python 2.
* ``ImportError: No module named BeautifulSoup`` - Caused by running
Beautiful Soup 3 code on a system that doesn't have BS3
installed. Or, by writing Beautiful Soup 4 code without knowing that
the package name has changed to ``bs4``.
* ``ImportError: No module named bs4`` - Caused by running Beautiful
Soup 4 code on a system that doesn't have BS4 installed.
Parsing XML
-----------
By default, Beautiful Soup parses documents as HTML. To parse a
document as XML, pass in "xml" as the second argument to the
``BeautifulSoup`` constructor::
soup = BeautifulSoup(markup, "xml")
You'll need to :ref:`have lxml installed `.
Other parser problems
---------------------
* If your script works on one computer but not another, it's probably
because the two computers have different parser libraries
available. For example, you may have developed the script on a
computer that has lxml installed, and then tried to run it on a
computer that only has html5lib installed. See `Differences between
parsers`_ for why this matters, and fix the problem by mentioning a
specific parser library in the ``BeautifulSoup`` constructor.
* ``HTMLParser.HTMLParseError: malformed start tag`` or
``HTMLParser.HTMLParseError: bad end tag`` - Caused by
giving Python's built-in HTML parser a document it can't handle. Any
other ``HTMLParseError`` is probably the same problem. Solution:
:ref:`Install lxml or html5lib. `
* If you can't find a tag that you know is in the document (that is,
``find_all()`` returned ``[]`` or ``find()`` returned ``None``),
you're probably using Python's built-in HTML parser, which sometimes
skips tags it doesn't understand. Solution: :ref:`Install lxml or
html5lib. `
Miscellaneous
-------------
* ``KeyError: [attr]`` - Caused by accessing ``tag['attr']`` when the
tag in question doesn't define the ``attr`` attribute. The most
common errors are ``KeyError: 'href'`` and ``KeyError:
'class'``. Use ``tag.get('attr')`` if you're not sure ``attr`` is
defined, just as you would with a Python dictionary.
* ``UnicodeEncodeError: 'charmap' codec can't encode character
u'\xfoo' in position bar`` (or just about any other
``UnicodeEncodeError``) - This is not a problem with Beautiful Soup:
you're trying to print a Unicode character that your console doesn't
know how to display. See `this page on the Python wiki
`_ for help. One easy
solution is to write the text to a file and then look at the file.
Improving Performance
---------------------
Beautiful Soup will never be as fast as the parsers it sits on top
of. If response time is critical, if you're paying for computer time
by the hour, or if there's any other reason why computer time is more
valuable than programmer time, you should forget about Beautiful Soup
and work directly atop `lxml `_.
That said, there are things you can do to speed up Beautiful Soup. If
you're not using lxml as the underlying parser, my advice is to
:ref:`start `. Beautiful Soup parses documents
significantly faster using lxml than using html.parser or html5lib.
Sometimes `Unicode, Dammit`_ can only detect the encoding of a file by
doing a byte-by-byte examination of the file. This slows Beautiful
Soup to a crawl. My tests indicate that this only happened on 2.x
versions of Python, and that it happened most often with documents
using Russian or Chinese encodings. If this is happening to you, you
can fix it by using Python 3 for your script. Or, if you happen to
know a document's encoding, you can pass it into the
``BeautifulSoup`` constructor as ``from_encoding``.
`Parsing only part of a document`_ won't save you much time parsing
the document, but it can save a lot of memory, and it'll make
`searching` the document much faster.
Beautiful Soup 3
================
Beautiful Soup 3 is the previous release series, and is no longer
being actively developed. It's currently packaged with all major Linux
distributions:
:kbd:`$ apt-get install python-beautifulsoup`
It's also published through PyPi as ``BeautifulSoup``.:
:kbd:`$ easy_install BeautifulSoup`
:kbd:`$ pip install BeautifulSoup`
You can also `download a tarball of Beautiful Soup 3.2.0
`_.
If you ran ``easy_install beautifulsoup`` or ``easy_install
BeautifulSoup``, but your code doesn't work, you installed Beautiful
Soup 3 by mistake. You need to run ``easy_install beautifulsoup4``.
`The documentation for Beautiful Soup 3 is archived online
`_. If
your first language is Chinese, it might be easier for you to read
`the Chinese translation of the Beautiful Soup 3 documentation
`_,
then read this document to find out about the changes made in
Beautiful Soup 4.
Porting code to BS4
-------------------
Most code written against Beautiful Soup 3 will work against Beautiful
Soup 4 with one simple change. All you should have to do is change the
package name from ``BeautifulSoup`` to ``bs4``. So this::
from BeautifulSoup import BeautifulSoup
becomes this::
from bs4 import BeautifulSoup
* If you get the ``ImportError`` "No module named BeautifulSoup", your
problem is that you're trying to run Beautiful Soup 3 code, but you
only have Beautiful Soup 4 installed.
* If you get the ``ImportError`` "No module named bs4", your problem
is that you're trying to run Beautiful Soup 4 code, but you only
have Beautiful Soup 3 installed.
Although BS4 is mostly backwards-compatible with BS3, most of its
methods have been deprecated and given new names for `PEP 8 compliance
`_. There are numerous other
renames and changes, and a few of them break backwards compatibility.
Here's what you'll need to know to convert your BS3 code and habits to BS4:
You need a parser
^^^^^^^^^^^^^^^^^
Beautiful Soup 3 used Python's ``SGMLParser``, a module that was
deprecated and removed in Python 3.0. Beautiful Soup 4 uses
``html.parser`` by default, but you can plug in lxml or html5lib and
use that instead. See `Installing a parser`_ for a comparison.
Since ``html.parser`` is not the same parser as ``SGMLParser``, it
will treat invalid markup differently. Usually the "difference" is
that ``html.parser`` crashes. In that case, you'll need to install
another parser. But sometimes ``html.parser`` just creates a different
parse tree than ``SGMLParser`` would. If this happens, you may need to
update your BS3 scraping code to deal with the new tree.
Method names
^^^^^^^^^^^^
* ``renderContents`` -> “encode_contents“
* “replaceWith“ -> “replace_with“
* “replaceWithChildren“ -> “unwrap“
* “findAll“ -> “find_all“
* “findAllNext“ -> “find_all_next“
* “findAllPrevious“ -> “find_all_previous“
* “findNext“ -> “find_next“
* “findNextSibling“ -> “find_next_sibling“
* “findNextSiblings“ -> “find_next_siblings“
* “findParent“ -> “find_parent“
* “findParents“ -> “find_parents“
* “findPrevious“ -> “find_previous“
* “findPreviousSibling“ -> “find_previous_sibling“
* “findPreviousSiblings“ -> “find_previous_siblings“
* “nextSibling“ -> “next_sibling“
* “previousSibling“ -> “previous_sibling“
Some arguments to the Beautiful Soup constructor were renamed for the
same reasons:
* “BeautifulSoup(parseOnlyThese=…)“ -> “BeautifulSoup(parse_only=…)“
* “BeautifulSoup(fromEncoding=…)“ -> “BeautifulSoup(from_encoding=…)“
I renamed one method for compatibility with Python 3:
* “Tag.has_key()“ -> “Tag.has_attr()“
I renamed one attribute to use more accurate terminology:
* “Tag.isSelfClosing“ -> “Tag.is_empty_element“
I renamed three attributes to avoid using words that have special
meaning to Python. Unlike the others, these changes are *not backwards
compatible.* If you used these attributes in BS3, your code will break
on BS4 until you change them.
* “UnicodeDammit.unicode“ -> “UnicodeDammit.unicode_markup“
* “Tag.next“ -> “Tag.next_element“
* “Tag.previous“ -> “Tag.previous_element“
Generators
^^^^^^^^^^
I gave the generators PEP 8-compliant names, and transformed them into
properties:
* “childGenerator()“ -> “children“
* “nextGenerator()“ -> “next_elements“
* “nextSiblingGenerator()“ -> “next_siblings“
* “previousGenerator()“ -> “previous_elements“
* “previousSiblingGenerator()“ -> “previous_siblings“
* “recursiveChildGenerator()“ -> “descendants“
* “parentGenerator()“ -> “parents“
So instead of this::
for parent in tag.parentGenerator():
…
You can write this::
for parent in tag.parents:
…
(But the old code will still work.)
Some of the generators used to yield “None“ after they were done, and
then stop. That was a bug. Now the generators just stop.
There are two new generators, :ref:`.strings and
.stripped_strings `. “.strings“ yields
NavigableString objects, and “.stripped_strings“ yields Python
strings that have had whitespace stripped.
XML
^^^
There is no longer a “BeautifulStoneSoup“ class for parsing XML. To
parse XML you pass in “xml” as the second argument to the
“BeautifulSoup“ constructor. For the same reason, the
“BeautifulSoup“ constructor no longer recognizes the “isHTML“
argument.
Beautiful Soup’s handling of empty-element XML tags has been
improved. Previously when you parsed XML you had to explicitly say
which tags were considered empty-element tags. The “selfClosingTags“
argument to the constructor is no longer recognized. Instead,
Beautiful Soup considers any empty tag to be an empty-element tag. If
you add a child to an empty-element tag, it stops being an
empty-element tag.
Entities
^^^^^^^^
An incoming HTML or XML entity is always converted into the
corresponding Unicode character. Beautiful Soup 3 had a number of
overlapping ways of dealing with entities, which have been
removed. The “BeautifulSoup“ constructor no longer recognizes the
“smartQuotesTo“ or “convertEntities“ arguments. (`Unicode,
Dammit`_ still has “smart_quotes_to“, but its default is now to turn
smart quotes into Unicode.)
If you want to turn those Unicode characters back into HTML entities
on output, rather than turning them into UTF-8 characters, you need to
use an :ref:`output formatter `.
Miscellaneous
^^^^^^^^^^^^^
:ref:`Tag.string ` now operates recursively. If tag A
contains a single tag B and nothing else, then A.string is the same as
B.string. (Previously, it was None.)
`Multi-valued attributes`_ like “class“ have lists of strings as
their values, not strings. This may affect the way you search by CSS
class.
If you pass one of the “find*“ methods both :ref:`text ` `and`
a tag-specific argument like :ref:`name `, Beautiful Soup will
search for tags that match your tag-specific criteria and whose
:ref:`Tag.string ` matches your value for :ref:`text
`. It will `not` find the strings themselves. Previously,
Beautiful Soup ignored the tag-specific arguments and looked for
strings.
The “BeautifulSoup“ constructor no longer recognizes the
`markupMassage` argument. It’s now the parser’s responsibility to
handle markup correctly.
The rarely-used alternate parser classes like
“ICantBelieveItsBeautifulSoup“ and “BeautifulSOAP“ have been
removed. It’s now the parser’s decision how to handle ambiguous
markup.
beautifulsoup4-4.1.0.tar/dist/beautifulsoup4-4.1.0/NEWS.txt
= 4.1.0 (20120529) =
* Added experimental support for fixing Windows-1252 characters
embedded in UTF-8 documents. (UnicodeDammit.detwingle())
* Fixed the handling of " with the built-in parser. [bug=993871]
* Comments, processing instructions, document type declarations, and
markup declarations are now treated as preformatted strings, the way
CData blocks are. [bug=1001025]
* Fixed a bug with the lxml treebuilder that prevented the user from
adding attributes to a tag that didn’t originally have
attributes. [bug=1002378] Thanks to Oliver Beattie for the patch.
* Fixed some edge-case bugs having to do with inserting an element
into a tag it’s already inside, and replacing one of a tag’s
children with another. [bug=997529]
* Added the ability to search for attribute values specified in UTF-8. [bug=1003974]
This caused a major refactoring of the search code. All the tests
pass, but it’s possible that some searches will behave differently.
= 4.0.5 (20120427) =
* Added a new method, wrap(), which wraps an element in a tag.
* Renamed replace_with_children() to unwrap(), which is easier to
understand and also the jQuery name of the function.
* Made encoding substitution in tags completely transparent (no
more %SOUP-ENCODING%).
* Fixed a bug in decoding data that contained a byte-order mark, such
as data encoded in UTF-16LE. [bug=988980]
* Fixed a bug that made the HTMLParser treebuilder generate XML
definitions ending with two question marks instead of
one. [bug=984258]
* Upon document generation, CData objects are no longer run through
the formatter. [bug=988905]
* The test suite now passes when lxml is not installed, whether or not
html5lib is installed. [bug=987004]
* Print a warning on HTMLParseErrors to let people know they should
install a better parser library.
= 4.0.4 (20120416) =
* Fixed a bug that sometimes created disconnected trees.
* Fixed a bug with the string setter that moved a string around the
tree instead of copying it. [bug=983050]
* Attribute values are now run through the provided output formatter.
Previously they were always run through the ‘minimal’ formatter. In
the future I may make it possible to specify different formatters
for attribute values and strings, but for now, consistent behavior
is better than inconsistent behavior. [bug=980237]
* Added the missing renderContents method from Beautiful Soup 3. Also
added an encode_contents() method to go along with decode_contents().
* Give a more useful error when the user tries to run the Python 2
version of BS under Python 3.
* UnicodeDammit can now convert Microsoft smart quotes to ASCII with
UnicodeDammit(markup, smart_quotes_to=”ascii”).
= 4.0.3 (20120403) =
* Fixed a typo that caused some versions of Python 3 to convert the
Beautiful Soup codebase incorrectly.
* Got rid of the 4.0.2 workaround for HTML documents–it was
unnecessary and the workaround was triggering a (possibly different,
but related) bug in lxml. [bug=972466]
= 4.0.2 (20120326) =
* Worked around a possible bug in lxml that prevents non-tiny XML
documents from being parsed. [bug=963880, bug=963936]
* Fixed a bug where specifying `text` while also searching for a tag
only worked if `text` wanted an exact string match. [bug=955942]
= 4.0.1 (20120314) =
* This is the first official release of Beautiful Soup 4. There is no
4.0.0 release, to eliminate any possibility that packaging software
might treat “4.0.0” as being an earlier version than “4.0.0b10”.
* Brought BS up to date with the latest release of soupselect, adding
CSS selector support for direct descendant matches and multiple CSS
class matches.
= 4.0.0b10 (20120302) =
* Added support for simple CSS selectors, taken from the soupselect project.
* Fixed a crash when using html5lib. [bug=943246]
* In HTML5-style tags, the value of the “charset”
attribute is now replaced with the appropriate encoding on
output. [bug=942714]
* Fixed a bug that caused calling a tag to sometimes call find_all()
with the wrong arguments. [bug=944426]
* For backwards compatibility, brought back the BeautifulStoneSoup
class as a deprecated wrapper around BeautifulSoup.
= 4.0.0b9 (20120228) =
* Fixed the string representation of DOCTYPEs that have both a public
ID and a system ID.
* Fixed the generated XML declaration.
* Renamed Tag.nsprefix to Tag.prefix, for consistency with
NamespacedAttribute.
* Fixed a test failure that occured on Python 3.x when chardet was
installed.
* Made prettify() return Unicode by default, so it will look nice on
Python 3 when passed into print().
= 4.0.0b8 (20120224) =
* All tree builders now preserve namespace information in the
documents they parse. If you use the html5lib parser or lxml’s XML
parser, you can access the namespace URL for a tag as tag.namespace.
However, there is no special support for namespace-oriented
searching or tree manipulation. When you search the tree, you need
to use namespace prefixes exactly as they’re used in the original
document.
* The string representation of a DOCTYPE always ends in a newline.
* Issue a warning if the user tries to use a SoupStrainer in
conjunction with the html5lib tree builder, which doesn’t support
them.
= 4.0.0b7 (20120223) =
* Upon decoding to string, any characters that can’t be represented in
your chosen encoding will be converted into numeric XML entity
references.
* Issue a warning if characters were replaced with REPLACEMENT
CHARACTER during Unicode conversion.
* Restored compatibility with Python 2.6.
* The install process no longer installs docs or auxillary text files.
* It’s now possible to deepcopy a BeautifulSoup object created with
Python’s built-in HTML parser.
* About 100 unit tests that “test” the behavior of various parsers on
invalid markup have been removed. Legitimate changes to those
parsers caused these tests to fail, indicating that perhaps
Beautiful Soup should not test the behavior of foreign
libraries.
The problematic unit tests have been reformulated as informational
comparisons generated by the script
scripts/demonstrate_parser_differences.py.
This makes Beautiful Soup compatible with html5lib version 0.95 and
future versions of HTMLParser.
= 4.0.0b6 (20120216) =
* Multi-valued attributes like “class” always have a list of values,
even if there’s only one value in the list.
* Added a number of multi-valued attributes defined in HTML5.
* Stopped generating a space before the slash that closes an
empty-element tag. This may come back if I add a special XHTML mode
(http://www.w3.org/TR/xhtml1/#C_2), but right now it’s pretty
useless.
* Passing text along with tag-specific arguments to a find* method:
find(“a”, text=”Click here”)
will find tags that contain the given text as their
.string. Previously, the tag-specific arguments were ignored and
only strings were searched.
* Fixed a bug that caused the html5lib tree builder to build a
partially disconnected tree. Generally cleaned up the html5lib tree
builder.
* If you restrict a multi-valued attribute like “class” to a string
that contains spaces, Beautiful Soup will only consider it a match
if the values correspond to that specific string.
= 4.0.0b5 (20120209) =
* Rationalized Beautiful Soup’s treatment of CSS class. A tag
belonging to multiple CSS classes is treated as having a list of
values for the ‘class’ attribute. Searching for a CSS class will
match *any* of the CSS classes.
This actually affects all attributes that the HTML standard defines
as taking multiple values (class, rel, rev, archive, accept-charset,
and headers), but ‘class’ is by far the most common. [bug=41034]
* If you pass anything other than a dictionary as the second argument
to one of the find* methods, it’ll assume you want to use that
object to search against a tag’s CSS classes. Previously this only
worked if you passed in a string.
* Fixed a bug that caused a crash when you passed a dictionary as an
attribute value (possibly because you mistyped “attrs”). [bug=842419]
* Unicode, Dammit now detects the encoding in HTML 5-style tags
like . [bug=837268]
* If Unicode, Dammit can’t figure out a consistent encoding for a
page, it will try each of its guesses again, with errors=”replace”
instead of errors=”strict”. This may mean that some data gets
replaced with REPLACEMENT CHARACTER, but at least most of it will
get turned into Unicode. [bug=754903]
* Patched over a bug in html5lib (?) that was crashing Beautiful Soup
on certain kinds of markup. [bug=838800]
* Fixed a bug that wrecked the tree if you replaced an element with an
empty string. [bug=728697]
* Improved Unicode, Dammit’s behavior when you give it Unicode to
begin with.
= 4.0.0b4 (20120208) =
* Added BeautifulSoup.new_string() to go along with BeautifulSoup.new_tag()
* BeautifulSoup.new_tag() will follow the rules of whatever
tree-builder was used to create the original BeautifulSoup object. A
new
tag will look like “
” if the soup object was created toparse XML, but it will look like “” if the soup object was
created to parse HTML.
* We pass in strict=False to html.parser on Python 3, greatly
improving html.parser’s ability to handle bad HTML.
* We also monkeypatch a serious bug in html.parser that made
strict=False disastrous on Python 3.2.2.
* Replaced the “substitute_html_entities” argument with the
more general “formatter” argument.
* Bare ampersands and angle brackets are always converted to XML
entities unless the user prevents it.
* Added PageElement.insert_before() and PageElement.insert_after(),
which let you put an element into the parse tree with respect to
some other element.
* Raise an exception when the user tries to do something nonsensical
like insert a tag into itself.
= 4.0.0b3 (20120203) =
Beautiful Soup 4 is a nearly-complete rewrite that removes Beautiful
Soup’s custom HTML parser in favor of a system that lets you write a
little glue code and plug in any HTML or XML parser you want.
Beautiful Soup 4.0 comes with glue code for four parsers:
* Python’s standard HTMLParser (html.parser in Python 3)
* lxml’s HTML and XML parsers
* html5lib’s HTML parser
HTMLParser is the default, but I recommend you install lxml if you
can.
For complete documentation, see the Sphinx documentation in
bs4/doc/source/. What follows is a summary of the changes from
Beautiful Soup 3.
=== The module name has changed ===
Previously you imported the BeautifulSoup class from a module also
called BeautifulSoup. To save keystrokes and make it clear which
version of the API is in use, the module is now called ‘bs4’:
>>> from bs4 import BeautifulSoup
=== It works with Python 3 ===
Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was
so bad that it barely worked at all. Beautiful Soup 4 works with
Python 3, and since its parser is pluggable, you don’t sacrifice
quality.
Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3
support to the finish line. Ezio Melotti is also to thank for greatly
improving the HTML parser that comes with Python 3.2.
=== CDATA sections are normal text, if they’re understood at all. ===
Currently, the lxml and html5lib HTML parsers ignore CDATA sections in
markup:
=>
A future version of html5lib will turn CDATA sections into text nodes,
but only within tags like
foo
The default XML parser (which uses lxml behind the scenes) turns CDATA
sections into ordinary text elements:
=>
foo
In theory it’s possible to preserve the CDATA sections when using the
XML parser, but I don’t see how to get it to work in practice.
=== Miscellaneous other stuff ===
If the BeautifulSoup instance has .is_xml set to True, an appropriate
XML declaration will be emitted when the tree is transformed into a
string:
…
The [‘lxml’, ‘xml’] tree builder sets .is_xml to True; the other tree
builders set it to False. If you want to parse XHTML with an HTML
parser, you can set it manually.
= 3.2.0 =
The 3.1 series wasn’t very useful, so I renamed the 3.0 series to 3.2
to make it obvious which one you should use.
= 3.1.0 =
A hybrid version that supports 2.4 and can be automatically converted
to run under Python 3.0. There are three backwards-incompatible
changes you should be aware of, but no new features or deliberate
behavior changes.
1. str() may no longer do what you want. This is because the meaning
of str() inverts between Python 2 and 3; in Python 2 it gives you a
byte string, in Python 3 it gives you a Unicode string.
The effect of this is that you can’t pass an encoding to .__str__
anymore. Use encode() to get a string and decode() to get Unicode, and
you’ll be ready (well, readier) for Python 3.
2. Beautiful Soup is now based on HTMLParser rather than SGMLParser,
which is gone in Python 3. There’s some bad HTML that SGMLParser
handled but HTMLParser doesn’t, usually to do with attribute values
that aren’t closed or have brackets inside them:
baz
“>
A later version of Beautiful Soup will allow you to plug in different
parsers to make tradeoffs between speed and the ability to handle bad
HTML.
3. In Python 3 (but not Python 2), HTMLParser converts entities within
attributes to the corresponding Unicode characters. In Python 2 it’s
possible to parse this string and leave the é intact.
In Python 3, the é is always converted to \xe9 during
parsing.
= 3.0.7a =
Added an import that makes BS work in Python 2.3.
= 3.0.7 =
Fixed a UnicodeDecodeError when unpickling documents that contain
non-ASCII characters.
Fixed a TypeError that occured in some circumstances when a tag
contained no text.
Jump through hoops to avoid the use of chardet, which can be extremely
slow in some circumstances. UTF-8 documents should never trigger the
use of chardet.
Whitespace is preserved inside
and