????JFIF??x?x????'
Server IP : 104.21.30.238 / Your IP : 216.73.216.83 Web Server : LiteSpeed System : Linux premium151.web-hosting.com 4.18.0-553.44.1.lve.el8.x86_64 #1 SMP Thu Mar 13 14:29:12 UTC 2025 x86_64 User : tempvsty ( 647) PHP Version : 8.0.30 Disable Function : NONE MySQL : OFF | cURL : ON | WGET : ON | Perl : ON | Python : ON | Sudo : OFF | Pkexec : OFF Directory : /proc/thread-self/cwd/wp-content/plugins/wordpress-importer/php-toolkit/XML/ |
Upload File : |
<?php namespace WordPress\XML; use WP_HTML_Span; use WP_HTML_Text_Replacement; use function WordPress\Encoding\utf8_codepoint_at; /** * XML API: XMLProcessor class * * Scans through an XML document to find specific tags, then * transforms those tags by adding, removing, or updating the * values of the XML attributes within that tag (opener). * * It implements a subset of the XML 1.0 specification (https://www.w3.org/TR/xml/) * and supports XML documents with the following characteristics: * * * XML 1.0 * * Well-formed * * UTF-8 encoded * * Not standalone (so can use external entities) * * No DTD, DOCTYPE, ATTLIST, ENTITY, or conditional sections (will fail on them) * * ### Possible future direction for this module * * XMLProcessor only aims to support XML 1.0, which is mostly well-defined and * supported across the web. There are no plans to extend the parsing logic to XML 1.1, * which is a more complex specification and not so widely supported. * * @TODO: Include the cursor string in internal bookmarks and use it for seeking. * * @TODO: Track specific error states, expose informative messages, line * numbers, indexes, and other debugging info. * * @TODO: Skip over the following syntax elements instead of failing: * * <!DOCTYPE, see https://www.w3.org/TR/xml/#sec-prolog-dtd * * <!ATTLIST, see https://www.w3.org/TR/xml/#attdecls * * <!ENTITY, see https://www.w3.org/TR/xml/#sec-entity-decl * * <!NOTATION, see https://www.w3.org/TR/xml/#sec-entity-decl * * Conditional sections, see https://www.w3.org/TR/xml/#sec-condition-sect * * @TODO: Support XML 1.1. * * @TODO: Evaluate the performance of utf8_codepoint_at() against using the mbstring * extension. If mbstring is faster, then use it whenever it's available with * utf8_codepoint_at() as a fallback. * * @package WordPress * @subpackage HTML-API * @since WP_VERSION */ /** * Core class used to modify attributes in an XML document for tags matching a query. * * ## Usage * * Use of this class requires three steps: * * 1. Create a new class instance with your input XML document. * 2. Find the tag(s) you are looking for. * 3. Request changes to the attributes in those tag(s). * * Example: * * $processor = XMLProcessor::create_from_string( $xml ); * if ( $processor->next_tag( 'option' ) ) { * $processor->set_attribute( '', 'selected', 'yes' ); * } * * ### Finding tags * * The `next_tag()` function moves the internal cursor through * your input XML document until it finds a tag meeting any of * the supplied restrictions in the optional query argument. If * no argument is provided then it will find the next XML tag, * regardless of what kind it is. * * If you want to _find whatever the next tag is_: * * $processor->next_tag(); * * | Goal | Query | * |----------------------------------------------------------|----------------------------------------------------------| * | Find any tag. | `$processor->next_tag();` | * | Find next image tag. | `$processor->next_tag( array( 'tag_name' => 'image' ) );`| * | Find next image tag (shorthand). | `$processor->next_tag( 'image' );` | * | Find next image tag in the "wp.org" namespace. | `$processor->next_tag( array( 'wp.org', 'image' ) );` | * * #### Namespace Examples * * To work with namespaces, you can use the `breadcrumbs` query format, where each breadcrumb is a tuple of (namespace prefix, local name): * * $xml = '<root xmlns:wp="http://wordpress.org/export/1.2/"><wp:image src="cat.jpg" /></root>'; * $processor = XMLProcessor::create_from_string( $xml ); * // Find the <wp:image> tag * if ( $processor->next_tag( array( 'http://wordpress.org/export/1.2/', 'image' ) ) ) { * // Get the namespace URI of the matched tag * $ns = $processor->get_tag_namespace(); // 'http://wordpress.org/export/1.2/' * // Get the value of the 'src' attribute * $src = $processor->get_attribute( $ns, 'src' ); * // Set a new attribute in the same namespace * $processor->set_attribute( $ns, 'alt', 'A cat' ); * } * * If a tag was found meeting your criteria then `next_tag()` * will return `true` and you can proceed to modify it. If it * returns `false`, it failed to find the tag and moved the cursor to the end of the file. * * Once the cursor reaches the end of the file the processor * is done and if you want to reach an earlier tag you will * need to recreate the processor and start over, as it's * unable to back up or move in reverse (except via bookmarks). * * #### Custom queries * * Sometimes it's necessary to further inspect an XML tag than * the query syntax here permits. In these cases one may further * inspect the search results using the read-only functions * provided by the processor or external state or variables. * * Example: * * // Paint up to the first five `musician` or `actor` tags marked with the "jazzy" style. * $remaining_count = 5; * while ( $remaining_count > 0 && $processor->next_tag() ) { * $tag = $processor->get_tag_local_name(); * if ( * ( 'musician' === $tag || 'actor' === $tag ) && * 'jazzy' === $processor->get_attribute( '', 'data-style' ) * ) { * $processor->set_attribute( '', 'theme-style', 'theme-style-everest-jazz' ); * $remaining_count--; * } * } * * `get_attribute()` will return `null` if the attribute wasn't present * on the tag when it was called. It may return `""` (the empty string) * in cases where the attribute was present but its value was empty. * For boolean attributes, those whose name is present but no value is * given, it will return `true` (the only way to set `false` for an * attribute is to remove it). * * #### When matching fails * * When `next_tag()` returns `false` it could mean different things: * * - The requested tag wasn't found in the input document. * - The input document ended in the middle of an XML syntax element. * * When a document ends in the middle of a syntax element it will pause * the processor. This is to make it possible in the future to extend the * input document and proceed - an important requirement for chunked * streaming parsing of a document. * * Example: * * $processor = XMLProcessor::create_from_string( 'This <content is="a" partial="token' ); * false === $processor->next_tag(); * * If a special element (see next section) is encountered but no closing tag * is found it will count as an incomplete tag. The parser will pause as if * the opening tag were incomplete. * * Example: * * $processor = XMLProcessor::create_from_string( '<style>// there could be more styling to come' ); * false === $processor->next_tag(); * * $processor = XMLProcessor::create_from_string( '<style>// this is everything</style><content>' ); * true === $processor->next_tag( 'content' ); * * * ### Modifying XML attributes for a found tag * * Once you've found the start of an opening tag you can modify * any number of the attributes on that tag. You can set a new * value for an attribute, remove the entire attribute, or do * nothing and move on to the next opening tag. * * Example: * * if ( $processor->next_tag( 'user-group' ) ) { * $processor->set_attribute( '', 'name', 'Content editors' ); * $processor->remove_attribute( '', 'data-test-id' ); * } * * If `set_attribute()` is called for an existing attribute it will * overwrite the existing value. Similarly, calling `remove_attribute()` * for a non-existing attribute has no effect on the document. Both * of these methods are safe to call without knowing if a given attribute * exists beforehand. * * #### Namespaced attribute example * * $processor = XMLProcessor::from_string( '<root xmlns:wp="http://wordpress.org/export/1.2/"><image /></root>' ); * * $ns = 'http://wordpress.org/export/1.2/'; * if ( $processor->next_tag( 'image' ) ) { * $processor->set_attribute( $ns, 'src', 'cat.jpg' ); * } * * echo $processor->get_modifiable_text(); * // <root xmlns:wp="http://wordpress.org/export/1.2/"><image wp:src="cat.jpg" /></root> * * ### Bookmarks * * While scanning through the input XML document it's possible to set * a named bookmark when a particular tag is found. Later on, after * continuing to scan other tags, it's possible to `seek` to one of * the set bookmarks and then proceed again from that point forward. * * Because bookmarks create processing overhead one should avoid * creating too many of them. As a rule, create only bookmarks * of known string literal names; avoid creating "mark_{$index}" * and so on. It's fine from a performance standpoint to create a * bookmark and update it frequently, such as within a loop. * * $total_todos = 0; * while ( $processor->next_tag( array( 'tag_name' => 'todo-list' ) ) ) { * $processor->set_bookmark( 'list-start' ); * while ( $processor->next_tag() ) { * if ( 'todo' === $processor->get_tag_local_name() && $processor->is_tag_closer() ) { * $processor->set_bookmark( 'list-end' ); * $processor->seek( 'list-start' ); * $processor->set_attribute( '', 'data-contained-todos', (string) $total_todos ); * $total_todos = 0; * $processor->seek( 'list-end' ); * break; * } * if ( 'todo-item' === $processor->get_tag_local_name() && ! $processor->is_tag_closer() ) { * $total_todos++; * } * } * } * * ## Tokens and finer-grained processing * * It's possible to scan through every lexical token in the * XML document using the `next_token()` function. This * alternative form takes no argument and provides no built-in * query syntax. * * Example: * * $title = '(untitled)'; * $text = ''; * while ( $processor->next_token() ) { * switch ( $processor->get_token_name() ) { * case '#text': * $text .= $processor->get_modifiable_text(); * break; * case 'new-line': * $text .= "\n"; * break; * case 'title': * $title = $processor->get_modifiable_text(); * break; * } * } * return trim( "# {$title}\n\n{$text}" ); * * ### Tokens and _modifiable text_ * * There are also non-elements which are void/self-closing in nature and contain * modifiable text that is part of that individual syntax token itself. * * - `#text` nodes, whose entire token _is_ the modifiable text. * - XML comments and tokens that become comments due to some syntax error. The * text for these tokens is the portion of the comment inside of the syntax. * E.g. for `<!-- comment -->` the text is `" comment "` (note the spaces are included). * - `CDATA` sections, whose text is the content inside of the section itself. E.g. for * `<![CDATA[some content]]>` the text is `"some content"`. * - XML Processing instruction nodes like `<?xml __( "Like" ); ?>` (with restrictions [1]). * * [1]: XML requires "xml" as a processing instruction name. The Tag Processor captures the entire * processing instruction as a single token up to the closing `?>`. * * ## Design and limitations * * The XMLProcessor is designed to linearly scan XML documents and tokenize * XML tags and their attributes. It's designed to do this as efficiently as * possible without compromising parsing integrity. Therefore it will be * slower than some methods of modifying XML, such as those incorporating * over-simplified PCRE patterns, but will not introduce the defects and * failures that those methods bring in, which lead to broken page renders * and often to security vulnerabilities. On the other hand, it will be faster * than full-blown XML parsers such as DOMDocument and use considerably * less memory. It requires a negligible memory overhead, enough to consider * it a zero-overhead system. * * The performance characteristics are maintained by avoiding tree construction. * * The XMLProcessor checks the most important aspects of XML integrity as it scans * through the document. It verifies that a single root element exists, that there are * no unclosed tags, and that each opener tag has a corresponding closer. It also * ensures no duplicate attributes exist on a single tag. * * At the same time, the XMLProcessor also skips expensive validation of XML entities * in the document. The processor will initially pass through invalid entity references * and only fail when the developer attempts to read their value. If that doesn't happen, * the invalid values will be left untouched in the final document. * * Most operations within the XMLProcessor are designed to minimize the difference * between an input and output document for any given change. For example, the * `set_attribute` and `remove_attribute` methods preserve whitespace and the attribute * ordering within the element definition. An exception to this rule is that all attribute * updates store their values as double-quoted strings, meaning that attributes on input with * single-quoted or unquoted values will appear in the output with double-quotes. * * ### Text Encoding * * The XMLProcessor assumes that the input XML document is encoded with a * UTF-8 encoding and will refuse to process documents that declare other encodings. * * ### Namespaces * * Namespaces are first-class citizens in the XMLProcessor. Methods such as `set_attribute()` and `remove_attribute()` * require the full namespace URI, not just the local name. The XML specification treats the local * name as a mere syntax sugar. The actual matching is always done on the fully qualified namespace name. * * Example: * * $processor = XMLProcessor::from_string( '<root xmlns:wp="http://wordpress.org/export/1.2/"><wp:image src="cat.jpg" /></root>' ); * $processor->next_tag( 'image' ); * $local_name = $processor->get_tag_local_name(); // 'image' * $ns = $processor->get_tag_namespace(); // 'http://wordpress.org/export/1.2/' * echo $processor->get_tag_namespace_and_local_name(); // '{http://wordpress.org/export/1.2/}image' * * #### Internal representation of names * * Internally, the XMLProcessor stores names using the following format: * * {namespace}local_name * * It's safe, because the "{" and "}" bytes cannot be used in tag names or attribute names: * * [4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] * [4a] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040] * * @since WP_VERSION */ class XMLProcessor { /** * The maximum number of bookmarks allowed to exist at * any given time. * * @since WP_VERSION * @var int * * @see XMLProcessor::set_bookmark() */ const MAX_BOOKMARKS = 10; /** * Maximum number of times seek() can be called. * Prevents accidental infinite loops. * * @since WP_VERSION * @var int * * @see XMLProcessor::seek() */ const MAX_SEEK_OPS = 1000; /** * The XML document to parse. * * @since WP_VERSION * @var string */ public $xml; /** * Specifies mode of operation of the parser at any given time. * * | State | Meaning | * | ------------------|----------------------------------------------------------------------| * | *Ready* | The parser is ready to run. | * | *Complete* | There is nothing left to parse. | * | *Incomplete* | The XML ended in the middle of a token; nothing more can be parsed. | * | *Matched tag* | Found an XML tag; it's possible to modify its attributes. | * | *Text node* | Found a #text node; this is plaintext and modifiable. | * | *CDATA node* | Found a CDATA section; this is modifiable. | * | *PI node* | Found a processing instruction; this is modifiable. | * | *XML declaration* | Found an XML declaration; this is modifiable. | * | *Comment* | Found a comment or bogus comment; this is modifiable. | * * @since WP_VERSION * * @see XMLProcessor::STATE_READY * @see XMLProcessor::STATE_COMPLETE * @see XMLProcessor::STATE_INCOMPLETE_INPUT * @see XMLProcessor::STATE_MATCHED_TAG * @see XMLProcessor::STATE_TEXT_NODE * @see XMLProcessor::STATE_CDATA_NODE * @see XMLProcessor::STATE_PI_NODE * @see XMLProcessor::STATE_XML_DECLARATION * @see XMLProcessor::STATE_COMMENT * * @var string */ protected $parser_state = self::STATE_READY; /** * Whether the input has been finished. * * @var bool */ protected $expecting_more_input = true; /** * How many bytes from the current XML chunk have been read and parsed. * * This value points to the latest byte offset in the input document which * has been already parsed. It is the internal cursor for the Tag Processor * and updates while scanning through the XML tokens. * * @since WP_VERSION * @var int */ public $bytes_already_parsed = 0; /** * How many XML bytes from the original stream have already been removed * from the memory. * * @since WP_VERSION * @var int */ public $upstream_bytes_forgotten = 0; /** * Byte offset in the current `$xml` string where the currently recognized token starts. * `null` if no token is currently active. * * Example: * * <content id="test">... * ^-- token_starts_at = 0 * * @since WP_VERSION * * @var int|null */ protected $token_starts_at; /** * Byte length of current token. * * Example: * * <content id="test">... * 012345678901234 * - token length is 14 - 0 = 14 * * a <!-- comment --> is a token. * 0123456789 123456789 123456789 * - token length is 17 - 2 = 15 * * @since WP_VERSION * * @var int|null */ private $token_length; /** * Currently matched XML element object. * * @var XMLElement|null */ private $element; /** * Byte offset in input document where current tag name starts. * * Example: * * <content id="test">... * 01234 * - tag name starts at 1 * * @since WP_VERSION * * @var int|null */ private $tag_name_starts_at; /** * Byte length of current tag name. * * Example: * * <content id="test">... * 01234 * --- tag name length is 3 * * @since WP_VERSION * * @var int|null */ private $tag_name_length; /** * Byte offset into input document where current modifiable text starts. * * @since WP_VERSION * * @var int */ private $text_starts_at; /** * Byte length of modifiable text. * * @since WP_VERSION * * @var string */ private $text_length; /** * Whether the current tag is an opening tag, e.g. <content>, or a closing tag, e.g. </content>. * * @var bool */ private $is_closing_tag; /** * Stores the error for why something failed, if it did. * * @see self::get_last_error * * @since WP_VERSION * * @var string|null */ protected $last_error = null; /** * Stores context for why the parser bailed on unsupported XML, if it did. * * @see self::get_exception * * @var XMLUnsupportedException|null */ private $exception = null; /** * Temporary index of attributes found within an XML tag, keyed by the qualified * attribute name. It is only used during the initial attributes parsing phase and * discarded once all the attributes have been parsed. * * @since WP_VERSION * @var XMLAttributeToken[] */ private $qualified_attributes = array(); /** * Stores the attributes found within an XML tag, keyed by their namespace * and local name combination. * * Example: * * // Supposing the parser just finished parsing the wp:content tag: * // <channel xmlns:wp="http://wordpress.org/export/1.2/"> * // <wp:content wp:id="test-4" class="outline"> * // </channel> * // * // Then, the attributes array would be: * $this->attributes = array( * '{http://wordpress.org/export/1.2/}id' => new XMLAttributeToken( 9, 6, 5, 14, 'wp', 'id' ), * 'class' => new XMLAttributeToken( 23, 7, 17, 13, '', 'class', '' ) * ); * * @since WP_VERSION * @var XMLAttributeToken[] */ private $attributes = array(); /** * Tracks a semantic location in the original XML which * shifts with updates as they are applied to the document. * * @since WP_VERSION * @var WP_HTML_Span[] */ protected $bookmarks = array(); /** * Lexical replacements to apply to input XML document. * * "Lexical" in this class refers to the part of this class which * operates on pure text _as text_ and not as XML. There's a line * between the public interface, with XML-semantic methods like * `set_attribute` and `add_class`, and an internal state that tracks * text offsets in the input document. * * When higher-level XML methods are called, those have to transform their * operations (such as setting an attribute's value) into text diffing * operations (such as replacing the sub-string from indices A to B with * some given new string). These text-diffing operations are the lexical * updates. * * As new higher-level methods are added they need to collapse their * operations into these lower-level lexical updates since that's the * Tag Processor's internal language of change. Any code which creates * these lexical updates must ensure that they do not cross XML syntax * boundaries, however, so these should never be exposed outside of this * class or any classes which intentionally expand its functionality. * * These are enqueued while editing the document instead of being immediately * applied to avoid processing overhead, string allocations, and string * copies when applying many updates to a single document. * * Example: * * // Replace an attribute stored with a new value, indices * // sourced from the lazily-parsed XML recognizer. * $start = $attributes['src']->start; * $length = $attributes['src']->length; * $modifications[] = new WP_HTML_Text_Replacement( $start, $length, $new_value ); * * // Correspondingly, something like this will appear in this array. * $lexical_updates = array( * WP_HTML_Text_Replacement( 14, 28, 'https://my-site.my-domain/wp-content/uploads/2014/08/kittens.jpg' ) * ); * * @since WP_VERSION * @var WP_HTML_Text_Replacement[] */ protected $lexical_updates = array(); /** * The Name from the DOCTYPE declaration. * * ``` * doctypedecl ::= '<!DOCTYPE' S Name (S ExternalID)? S? ('[' intSubset ']' S?)? '>' * ^^^^ * ``` * * @since WP_VERSION * @var WP_HTML_Span|null */ protected $doctype_name = null; /** * The system literal value from the DOCTYPE declaration. * * ``` * doctypedecl ::= '<!DOCTYPE' S Name (S ExternalID)? S? ('[' intSubset ']' S?)? '>' * ExternalID ::= 'SYSTEM' S SystemLiteral | 'PUBLIC' S PubidLiteral * ^^^^^^^^^^^^^ * ``` * * Example: * * <!DOCTYPE html SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> * * In this example, the system_literal would be: * "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" * * @since WP_VERSION * @var WP_HTML_Span|null */ protected $system_literal = null; /** * The public identifier value from the DOCTYPE declaration. * * ``` * doctypedecl ::= '<!DOCTYPE' S Name (S ExternalID)? S? ('[' intSubset ']' S?)? '>' * ExternalID ::= 'SYSTEM' S SystemLiteral | 'PUBLIC' S PubidLiteral * ``` * ^^^^^^^^^^^^ * Example: * * ``` * <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> * ``` * * In this example, the publid_literal would be: * "-//W3C//DTD XHTML 1.0 Strict//EN" * * @since WP_VERSION * @var WP_HTML_Span|null */ protected $pubid_literal = null; /** * Memory budget for the processed XML. * * `append_bytes()` will flush the processed bytes whenever the XML buffer * exceeds this budget. The lexical updates will be applied and the bookmarks * will be reset. * * @var int */ protected $memory_budget = 1024 * 1024 * 1024; /** * Tracks and limits `seek()` calls to prevent accidental infinite loops. * * @since WP_VERSION * @var int * * @see XMLProcessor::seek() */ protected $seek_count = 0; /** * Indicates the current parsing stage. * * A well-formed XML document has the following structure: * * document ::= prolog element Misc* * prolog ::= XMLDecl? Misc* (doctypedecl Misc*)? * Misc ::= Comment | PI | S * * There is exactly one element, called the root. No elements or text nodes may * precede or follow it. * * See https://www.w3.org/TR/xml/#NT-document. * * | Stage | Meaning | * | ----------------|---------------------------------------------------------------------| * | *Prolog* | The parser is parsing the prolog. | * | *Element* | The parser is parsing the root element. | * | *Misc* | The parser is parsing miscellaneous content. | * * @see XMLProcessor::IN_PROLOG_CONTEXT * @see XMLProcessor::IN_ELEMENT_CONTEXT * @see XMLProcessor::IN_MISC_CONTEXT * * @since WP_VERSION * @var bool */ protected $parser_context = self::IN_PROLOG_CONTEXT; /** * Top-level namespaces for the currently parsed document. * * @var array */ private $document_namespaces; /** * Tracks open elements and their namespaces while scanning XML. * * @var array */ private $stack_of_open_elements = array(); public static function create_from_string( $xml, $cursor = null, $known_definite_encoding = 'UTF-8', $document_namespaces = array() ) { $processor = static::create_for_streaming( $xml, $cursor, $known_definite_encoding, $document_namespaces ); if ( null === $processor ) { return false; } $processor->input_finished(); return $processor; } public static function create_for_streaming( $xml = '', $cursor = null, $known_definite_encoding = 'UTF-8', $document_namespaces = array() ) { if ( 'UTF-8' !== $known_definite_encoding ) { return false; } $processor = new XMLProcessor( $xml, $document_namespaces, self::CONSTRUCTOR_UNLOCK_CODE ); if ( null !== $cursor && true !== $processor->initialize_from_cursor( $cursor ) ) { return false; } return $processor; } /** * Returns a re-entrancy cursor – it's a string that can instruct a new XML * Processor instance to continue parsing from the current location in the * document. * * The only stable part of this API is the return type of string. The consumer * of this method MUST NOT assume any specific structure of the returned * string. It will change without a warning between WordPress releases. * * This is not a tell() API. No XML Processor method will accept the cursor * to move to another location. The only way to use this cursor is creating * a new XML Processor instance. If you need to move around the document, use * `set_bookmark()` and `seek()`. */ public function get_reentrancy_cursor() { $stack_of_open_elements = array(); foreach ( $this->stack_of_open_elements as $element ) { $stack_of_open_elements[] = $element->to_array(); } return base64_encode( // phpcs:ignore WordPress.PHP.DiscouragedPHPFunctions.obfuscation_base64_encode json_encode( array( 'is_finished' => $this->is_finished(), 'upstream_bytes_forgotten' => $this->upstream_bytes_forgotten, 'parser_context' => $this->parser_context, 'stack_of_open_elements' => $stack_of_open_elements, 'expecting_more_input' => $this->expecting_more_input, 'document_namespaces' => $this->document_namespaces, ) ) ); } /** * Returns the byte offset in the input stream where the current token starts. * * You should probably not use this method. * * It's only exists to allow resuming the input stream at the same offset where * the XML parsing was finished. It will never expose any attribute's byte * offset and no method in the XML processor API will ever accept the byte offset * to move to another location. If you need to move around the document, use * `set_bookmark()` and `seek()` instead. */ public function get_token_byte_offset_in_the_input_stream() { return $this->token_starts_at + $this->upstream_bytes_forgotten; } protected function initialize_from_cursor( $cursor ) { if ( ! is_string( $cursor ) ) { _doing_it_wrong( __METHOD__, 'Cursor must be a JSON-encoded string.', '1.0.0' ); return false; } $cursor = base64_decode( $cursor ); // phpcs:ignore WordPress.PHP.DiscouragedPHPFunctions.obfuscation_base64_decode if ( false === $cursor ) { _doing_it_wrong( __METHOD__, 'Invalid cursor provided to initialize_from_cursor().', '1.0.0' ); return false; } $cursor = json_decode( $cursor, true ); if ( false === $cursor ) { _doing_it_wrong( __METHOD__, 'Invalid cursor provided to initialize_from_cursor().', '1.0.0' ); return false; } if ( $cursor['is_finished'] ) { $this->parser_state = self::STATE_COMPLETE; } // Assume the input stream will start from the last known byte offset. $this->bytes_already_parsed = 0; $this->upstream_bytes_forgotten = $cursor['upstream_bytes_forgotten']; $this->stack_of_open_elements = array(); foreach ( $cursor['stack_of_open_elements'] as $element ) { array_push( $this->stack_of_open_elements, XMLElement::from_array( $element ) ); } $this->document_namespaces = $cursor['document_namespaces']; $this->parser_context = $cursor['parser_context']; $this->expecting_more_input = $cursor['expecting_more_input']; return true; } /** * Constructor. * * Do not use this method. Use the static creator methods instead. * * @access private * * @param string $xml XML to process. * @param array $document_namespaces Document namespaces. * @param string|null $use_the_static_create_methods_instead This constructor should not be called manually. * * @see XMLProcessor::create_stream() * * @since 6.4.0 * * @see XMLProcessor::create_fragment() */ protected function __construct( $xml, $document_namespaces = array(), $use_the_static_create_methods_instead = null ) { if ( self::CONSTRUCTOR_UNLOCK_CODE !== $use_the_static_create_methods_instead ) { _doing_it_wrong( __METHOD__, sprintf( /* translators: %s: XMLProcessor::create_fragment(). */ __( 'Call %s to create an XML Processor instead of calling the constructor directly.' ), '<code>XMLProcessor::create_fragment()</code>' ), '6.4.0' ); } $this->xml = isset( $xml ) ? $xml : ''; $this->document_namespaces = array_merge( $document_namespaces, // These initial namespaces cannot be overridden. array( 'xml' => 'http://www.w3.org/XML/1998/namespace', // Predefined, cannot be unbound or changed. 'xmlns' => 'http://www.w3.org/2000/xmlns/', // Reserved for xmlns attributes, not a real namespace for elements/attributes. '' => '', // Default namespace is initially empty (no namespace). ) ); } /** * Wipes out the processed XML and appends the next chunk of XML to * any remaining unprocessed XML. * * @param string $next_chunk XML to append. */ public function append_bytes( $next_chunk ) { if ( ! $this->expecting_more_input ) { _doing_it_wrong( __METHOD__, __( 'Cannot append bytes after the last input chunk was provided and input_finished() was called.' ), 'WP_VERSION' ); return false; } $this->xml .= $next_chunk; if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) { $this->parser_state = self::STATE_READY; } // Periodically flush the processed bytes to avoid high memory usage. if ( null !== $this->memory_budget && strlen( $this->xml ) > $this->memory_budget ) { $this->flush_processed_xml(); } return true; } /** * Forgets the XML bytes that have been processed and are no longer needed to * avoid high memory usage. * * @return string The flushed bytes. */ private function flush_processed_xml() { // Flush updates. $this->get_updated_xml(); $unreferenced_bytes = $this->bytes_already_parsed; if ( null !== $this->token_starts_at ) { $unreferenced_bytes = min( $unreferenced_bytes, $this->token_starts_at ); } $flushed_bytes = substr( $this->xml, 0, $unreferenced_bytes ); $this->xml = substr( $this->xml, $unreferenced_bytes ); $this->bookmarks = array(); $this->lexical_updates = array(); $this->seek_count = 0; $this->bytes_already_parsed -= $unreferenced_bytes; if ( null !== $this->token_starts_at ) { $this->token_starts_at -= $unreferenced_bytes; } if ( null !== $this->tag_name_starts_at ) { $this->tag_name_starts_at -= $unreferenced_bytes; } if ( null !== $this->text_starts_at ) { $this->text_starts_at -= $unreferenced_bytes; } $this->upstream_bytes_forgotten += $unreferenced_bytes; return $flushed_bytes; } /** * Indicates that all the XML document bytes have been provided. * * After calling this method, the processor will emit errors where * previously it would have entered the STATE_INCOMPLETE_INPUT state. */ public function input_finished() { $this->expecting_more_input = false; $this->parser_state = self::STATE_READY; } /** * Indicates if the processor is expecting more data bytes. * If not, the processor will expect the remaining XML bytes to form * a valid document and will not stop on incomplete input. * * @return bool Whether the processor is expecting more data bytes. */ public function is_expecting_more_input() { return $this->expecting_more_input; } /** * Internal method which finds the next token in the XML document. * * This method is a protected internal function which implements the logic for * finding the next token in a document. It exists so that the parser can update * its state without affecting the location of the cursor in the document and * without triggering subclass methods for things like `next_token()`, e.g. when * applying patches before searching for the next token. * * @return bool Whether a token was parsed. * @since 6.5.0 * * @access private */ protected function parse_next_token() { $was_at = $this->bytes_already_parsed; $this->after_tag(); // Don't proceed if there's nothing more to scan. if ( self::STATE_COMPLETE === $this->parser_state || self::STATE_INCOMPLETE_INPUT === $this->parser_state || null !== $this->last_error ) { return false; } /* * The next step in the parsing loop determines the parsing state; * clear it so that state doesn't linger from the previous step. */ $this->parser_state = self::STATE_READY; if ( $this->bytes_already_parsed >= strlen( $this->xml ) ) { if ( $this->expecting_more_input ) { $this->parser_state = self::STATE_INCOMPLETE_INPUT; } else { $this->parser_state = self::STATE_COMPLETE; } return false; } // Find the next tag if it exists. if ( false === $this->parse_next_tag() ) { if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) { $this->bytes_already_parsed = $was_at; } return false; } if ( null !== $this->last_error ) { return false; } /* * For legacy reasons the rest of this function handles tags and their * attributes. If the processor has reached the end of the document * or if it matched any other token then it should return here to avoid * attempting to process tag-specific syntax. */ if ( self::STATE_INCOMPLETE_INPUT !== $this->parser_state && self::STATE_COMPLETE !== $this->parser_state && self::STATE_MATCHED_TAG !== $this->parser_state ) { return true; } if ( $this->is_closing_tag ) { $this->skip_whitespace(); } else { // Parse all of its attributes. while ( $this->parse_next_attribute() ) { continue; } } if ( null !== $this->last_error ) { return false; } if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) { $this->bytes_already_parsed = $was_at; return false; } // Ensure that the tag closes before the end of the document. if ( $this->bytes_already_parsed >= strlen( $this->xml ) ) { // Does this appropriately clear state (parsed attributes)? $this->mark_incomplete_input( 'Tag attributes were not closed before the end of the document.' ); $this->bytes_already_parsed = $was_at; return false; } $tag_ends_at = strpos( $this->xml, '>', $this->bytes_already_parsed ); if ( false === $tag_ends_at ) { $this->mark_incomplete_input( 'No > found at the end of a tag.' ); $this->bytes_already_parsed = $was_at; return false; } if ( $this->is_closing_tag && $tag_ends_at !== $this->bytes_already_parsed ) { $this->bail( 'Invalid closing tag encountered.', self::ERROR_SYNTAX ); return false; } $this->parser_state = self::STATE_MATCHED_TAG; $this->bytes_already_parsed = $tag_ends_at + 1; $this->token_length = $this->bytes_already_parsed - $this->token_starts_at; /** * Resolve the namespaces defined in opening tags. */ if ( ! $this->is_closing_tag ) { /** * By default, inherit all namespaces from the parent element. */ $namespaces = $this->get_tag_namespaces_in_scope(); foreach ( $this->qualified_attributes as $attribute ) { /** * `xmlns` attribute is the default namespace * `xmlns:<prefix>` declares a namespace prefix scoped to the current element and its descendants * * @see https://www.w3.org/TR/2006/REC-xml-names11-20060816/#ns-decl */ if ( 'xmlns' === $attribute->qualified_name ) { $value = $this->get_qualified_attribute( $attribute->qualified_name ); // Update the default namespace. $namespaces[''] = $value; continue; } if ( 'xmlns' === $attribute->namespace_prefix ) { $value = $this->get_qualified_attribute( $attribute->qualified_name ); /** * @see https://www.w3.org/TR/2006/REC-xml-names11-20060816/#xmlReserved */ if ( 'xml' === $attribute->local_name && 'http://www.w3.org/XML/1998/namespace' !== $value ) { $this->bail( 'The `xml` namespace prefix is by definition bound to the namespace name http://www.w3.org/XML/1998/namespace and must not be overridden.', self::ERROR_SYNTAX ); return false; } /** * @see https://www.w3.org/TR/2006/REC-xml-names11-20060816/#xmlReserved */ if ( 'xmlns' === $attribute->local_name ) { $this->bail( 'The `xmlns` namespace prefix must not be overridden.', self::ERROR_SYNTAX ); return false; } /** * The attribute value in a namespace declaration for a prefix MAY be empty. * This has the effect, within the scope of the declaration, of removing any * association of the prefix with a namespace name. Further declarations MAY * re-declare the prefix again. */ if ( '' === $value ) { unset( $namespaces[ $attribute->namespace_prefix ] ); continue; } $namespaces[ $attribute->local_name ] = $value; continue; } } /** * Confirm the tag name is valid with respect to XML namespaces. * * @see https://www.w3.org/TR/2006/REC-xml-names11-20060816/#Conformance */ $tag_name = $this->get_tag_name_qualified(); if ( false === $this->validate_qualified_name( $tag_name ) ) { return false; } list( $tag_namespace_prefix, $tag_local_name ) = $this->parse_qualified_name( $tag_name ); /** * Validate the element namespace. */ if ( ! array_key_exists( $tag_namespace_prefix, $namespaces ) ) { $this->bail( sprintf( 'Namespace prefix "%s" does not resolve to any namespace in the current element\'s scope.', $tag_namespace_prefix ), self::ERROR_SYNTAX ); } /** * Compute fully qualified attributes and assert: * * * All attributes have valid namespaces. * * No two attributes have the same (local name, namespace) pair. * * @see https://www.w3.org/TR/2006/REC-xml-names11-20060816/#uniqAttrs */ $namespaced_attributes = array(); foreach ( $this->qualified_attributes as $attribute ) { list( $attribute_namespace_prefix, $attribute_local_name ) = $this->parse_qualified_name( $attribute->qualified_name ); if ( ! array_key_exists( $attribute_namespace_prefix, $namespaces ) ) { $this->bail( sprintf( 'Attribute "%s" has an invalid namespace prefix "%s".', $attribute->qualified_name, $attribute_namespace_prefix ), self::ERROR_SYNTAX ); return false; } $namespace_reference = $attribute_namespace_prefix ? $namespaces[ $attribute_namespace_prefix ] : ''; /** * It looks supicious but it's safe – $local_name is guaranteed to not contain * curly braces at this point. */ $attribute_full_name = $namespace_reference ? '{' . $namespace_reference . '}' . $attribute_local_name : $attribute_local_name; if ( isset( $namespaced_attributes[ $attribute_full_name ] ) ) { $this->bail( sprintf( 'Duplicate attribute "%s" with namespace "%s" found in the same element.', $attribute_local_name, $namespace_reference ), self::ERROR_SYNTAX ); return false; } $namespaced_attributes[ $attribute_full_name ] = $attribute; $attribute->namespace = $namespace_reference; } // Store attributes with their namespaces and discard the temporary // qualified attributes array. $this->attributes = $namespaced_attributes; $this->qualified_attributes = array(); $this->element = new XMLElement( $tag_local_name, $tag_namespace_prefix, $namespaces[ $tag_namespace_prefix ], $namespaces ); // Closers assume $this->element is the element created for the opener. // @see step_in_element. } /* * Preserve the opening tag pointers, as these will be overwritten * when finding the closing tag. They will be reset after finding * the closing to tag to point to the opening of the special atomic * tag sequence. */ $tag_name_starts_at = $this->tag_name_starts_at; $tag_name_length = $this->tag_name_length; $tag_ends_at = $this->token_starts_at + $this->token_length; $attributes = $this->qualified_attributes; /* * The values here look like they reference the opening tag but they reference * the closing tag instead. This is why the opening tag values were stored * above in a variable. It reads confusingly here, but that's because the * functions that skip the contents have moved all the internal cursors past * the inner content of the tag. */ $this->token_starts_at = $was_at; $this->token_length = $this->bytes_already_parsed - $this->token_starts_at; $this->text_starts_at = $tag_ends_at; $this->text_length = $this->tag_name_starts_at - $this->text_starts_at; $this->tag_name_starts_at = $tag_name_starts_at; $this->tag_name_length = $tag_name_length; $this->qualified_attributes = $attributes; return true; } private function get_tag_namespaces_in_scope() { $top = $this->top_element(); if ( null === $top ) { // Namespaces defined by default in every XML document. return $this->document_namespaces; } return $top->namespaces_in_scope; } /** * Returns the namespace prefix of the matched tag or, when the $namespace * argument is given, the prefix of the requested fully-qualified namespace . * * Examples: * * $p = new XMLProcessor( '<wp:content xmlns:xhtml="http://www.w3.org/1999/xhtml">Test</wp:content>' ); * $p->next_tag() === true; * $p->get_tag_namespace_prefix() === 'xhtml'; * * $p = new XMLProcessor( ' * <wp:content * xmlns:xhtml="http://www.w3.org/1999/xhtml" * xmlns:wp="http://wordpress.org/export/1.2/" * > * Test * </wp:content> * ' ); * $p->next_tag() === true; * $p->get_tag_namespace_prefix('http://wordpress.org/export/1.2/') === 'wp'; * * @internal * @param string|null $xml_namespace Fully-qualified namespace to return the prefix for. * @return string|null The namespace prefix of the matched tag, or null if not available. */ private function get_tag_namespace_prefix( $xml_namespace = null ) { if ( null === $xml_namespace ) { if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { return null; } return $this->element->namespace_prefix; } else { $namespaces_in_scope = $this->get_tag_namespaces_in_scope(); foreach ( $namespaces_in_scope as $prefix => $uri ) { if ( $uri === $xml_namespace ) { return $prefix; } } return false; } } /** * Returns the top XMLElement on the stack without removing it. * * @return XMLElement|null Returns the top element, or null if stack is empty. */ private function top_element() { if ( empty( $this->stack_of_open_elements ) ) { return null; } return $this->stack_of_open_elements[ count( $this->stack_of_open_elements ) - 1 ]; } /** * Whether the processor paused because the input XML document ended * in the middle of a syntax element, such as in the middle of a tag. * * Example: * * $processor = new XMLProcessor( '<input type="text" value="Th' ); * false === $processor->get_next_tag(); * true === $processor->is_paused_at_incomplete_token(); * * @return bool Whether the parse paused at the start of an incomplete token. * @since WP_VERSION */ public function is_paused_at_incomplete_input(): bool { return self::STATE_INCOMPLETE_INPUT === $this->parser_state; } /** * Whether the processor finished processing. * * @return bool Whether the processor finished processing. * @since WP_VERSION */ public function is_finished() { return self::STATE_COMPLETE === $this->parser_state; } /** * Sets a bookmark in the XML document. * * Bookmarks represent specific places or tokens in the XML * document, such as a tag opener or closer. When applying * edits to a document, such as setting an attribute, the * text offsets of that token may shift; the bookmark is * kept updated with those shifts and remains stable unless * the entire span of text in which the token sits is removed. * * Release bookmarks when they are no longer needed. * * Example: * * <main><h2>Surprising fact you may not know!</h2></main> * ^ ^ * \-|-- this `H2` opener bookmark tracks the token * * <main class="clickbait"><h2>Surprising fact you may no… * ^ ^ * \-|-- it shifts with edits * * Bookmarks provide the ability to seek to a previously-scanned * place in the XML document. This avoids the need to re-scan * the entire document. * * Example: * * <ul><li>One</li><li>Two</li><li>Three</li></ul> * ^^^^ * want to note this last item * * $p = new XMLProcessor( $xml ); * $in_list = false; * while ( $p->next_tag( array( 'tag_closers' => $in_list ? 'visit' : 'skip' ) ) ) { * if ( 'UL' === $p->get_qualified_tag() ) { * if ( $p->is_tag_closer() ) { * $in_list = false; * $p->set_bookmark( 'resume' ); * if ( $p->seek( 'last-li' ) ) { * $p->add_class( 'last-li' ); * } * $p->seek( 'resume' ); * $p->release_bookmark( 'last-li' ); * $p->release_bookmark( 'resume' ); * } else { * $in_list = true; * } * } * * if ( 'LI' === $p->get_qualified_tag() ) { * $p->set_bookmark( 'last-li' ); * } * } * * Bookmarks intentionally hide the internal string offsets * to which they refer. They are maintained internally as * updates are applied to the XML document and therefore * retain their "position" - the location to which they * originally pointed. The inability to use bookmarks with * functions like `substr` is therefore intentional to guard * against accidentally breaking the XML. * * Because bookmarks allocate memory and require processing * for every applied update, they are limited and require * a name. They should not be created with programmatically-made * names, such as "li_{$index}" with some loop. As a general * rule they should only be created with string-literal names * like "start-of-section" or "last-paragraph". * * Bookmarks are a powerful tool to enable complicated behavior. * Consider double-checking that you need this tool if you are * reaching for it, as inappropriate use could lead to broken * XML structure or unwanted processing overhead. * * @param string $name Identifies this particular bookmark. * * @return bool Whether the bookmark was successfully created. * @since WP_VERSION */ public function set_bookmark( $name ) { // It only makes sense to set a bookmark if the parser has paused on a concrete token. if ( self::STATE_COMPLETE === $this->parser_state || self::STATE_INCOMPLETE_INPUT === $this->parser_state ) { return false; } if ( ! array_key_exists( $name, $this->bookmarks ) && count( $this->bookmarks ) >= static::MAX_BOOKMARKS ) { _doing_it_wrong( __METHOD__, __( 'Too many bookmarks: cannot create any more.' ), 'WP_VERSION' ); return false; } $this->bookmarks[ $name ] = new WP_HTML_Span( $this->token_starts_at, $this->token_length ); return true; } /** * Removes a bookmark that is no longer needed. * * Releasing a bookmark frees up the small * performance overhead it requires. * * @param string $name Name of the bookmark to remove. * * @return bool Whether the bookmark already existed before removal. */ public function release_bookmark( $name ) { if ( ! array_key_exists( $name, $this->bookmarks ) ) { return false; } unset( $this->bookmarks[ $name ] ); return true; } /** * Returns the last error, if any. * * Various situations lead to parsing failure but this class will * return `false` in all those cases. To determine why something * failed it's possible to request the last error. This can be * helpful to know to distinguish whether a given tag couldn't * be found or if content in the document caused the processor * to give up and abort processing. * * Example * * $processor = XMLProcessor::create_fragment( '<content invalid-attr></content>' ); * false === $processor->next_tag(); * XMLProcessor::ERROR_SYNTAX === $processor->get_last_error(); * * @return string|null The last error, if one exists, otherwise null. * @see self::ERROR_UNSUPPORTED * @see self::ERROR_EXCEEDED_MAX_BOOKMARKS * * @since WP_VERSION */ public function get_last_error(): ?string { return $this->last_error; } /** * Finds the next element matching the $query. * * This doesn't currently have a way to represent non-tags and doesn't process * semantic rules for text nodes. For access to the raw tokens consider using * XMLProcessor instead. * * @param array|string|null $query_or_ns { * Optional. Which tag name to find, having which class, etc. Default is to find any tag. * * @type string|null $tag_name Which tag to find, or `null` for "any tag." * @type int|null $match_offset Find the Nth tag matching all search criteria. * 1 for "first" tag, 3 for "third," etc. * Defaults to first tag. * @type string[] $breadcrumbs DOM sub-path at which element is found, e.g. `array( 'FIGURE', 'IMG' )`. * May also contain the wildcard `*` which matches a single element, e.g. `array( 'SECTION', '*' )`. * } * @return bool Whether a tag was matched. * @since WP_VERSION */ public function next_tag( $query_or_ns = null, $null_or_local_name = null ) { if ( null === $query_or_ns && null === $null_or_local_name ) { while ( $this->step() ) { if ( '#tag' !== $this->get_token_type() ) { continue; } if ( ! $this->is_tag_closer() ) { return true; } } return false; } if ( is_string( $query_or_ns ) ) { if ( is_string( $null_or_local_name ) ) { $query = array( 'breadcrumbs' => array( array( $query_or_ns, $null_or_local_name ) ) ); } else { $query = array( 'breadcrumbs' => array( array( '', $query_or_ns ) ) ); } } else { $query = $query_or_ns; } if ( ! is_array( $query ) ) { _doing_it_wrong( __METHOD__, __( 'Please pass a query array to this function.' ), 'WP_VERSION' ); return false; } if ( array( 0, 1 ) === array_keys( $query ) && is_string( $query[0] ) && is_string( $query[1] ) ) { $query = array( 'breadcrumbs' => array( $query ) ); } if ( ! ( array_key_exists( 'breadcrumbs', $query ) && is_array( $query['breadcrumbs'] ) ) ) { while ( $this->step() ) { if ( '#tag' !== $this->get_token_type() ) { continue; } if ( ! $this->is_tag_closer() ) { return true; } } return false; } if ( isset( $query['tag_closers'] ) && 'visit' === $query['tag_closers'] ) { _doing_it_wrong( __METHOD__, __( 'Cannot visit tag closers in XML Processor.' ), 'WP_VERSION' ); return false; } $namespaced_breadcrumbs = array(); foreach ( $query['breadcrumbs'] as $breadcrumb ) { if ( is_array( $breadcrumb ) && 2 === count( $breadcrumb ) ) { $namespaced_breadcrumbs[] = $breadcrumb; } elseif ( is_string( $breadcrumb ) ) { $namespaced_breadcrumbs[] = array( '', $breadcrumb ); } else { _doing_it_wrong( __METHOD__, __( 'Breadcrumbs must be an array of strings or two-tuples of (namespace, local name).' ), 'WP_VERSION' ); } } $breadcrumbs = $namespaced_breadcrumbs; $match_offset = isset( $query['match_offset'] ) ? (int) $query['match_offset'] : 1; while ( $match_offset > 0 && $this->step() ) { if ( '#tag' !== $this->get_token_type() ) { continue; } if ( $this->matches_breadcrumbs( $breadcrumbs ) && 0 === --$match_offset ) { return true; } } return false; } /** * Parses the next tag. * * This will find and start parsing the next tag, including * the opening `<`, the potential closer `/`, and the tag * name. It does not parse the attributes or scan to the * closing `>`; these are left for other methods. * * @return bool Whether a tag was found before the end of the document. * @since WP_VERSION */ private function parse_next_tag() { $this->after_tag(); $xml = $this->xml; $doc_length = strlen( $xml ); $was_at = $this->bytes_already_parsed; $at = $was_at; while ( false !== $at && $at < $doc_length ) { $at = strpos( $xml, '<', $at ); if ( false === $at ) { break; } if ( $at > $was_at ) { $this->parser_state = self::STATE_TEXT_NODE; $this->token_starts_at = $was_at; $this->token_length = $at - $was_at; $this->text_starts_at = $was_at; $this->text_length = $this->token_length; $this->bytes_already_parsed = $at; return true; } $this->token_starts_at = $at; if ( $at + 1 < $doc_length && '/' === $this->xml[ $at + 1 ] ) { $this->is_closing_tag = true; ++$at; } else { $this->is_closing_tag = false; } if ( $at + 1 >= $doc_length ) { $this->mark_incomplete_input(); return false; } /* * XML tag names are defined by the same `Name` grammar rule as attribute * names. * * Reference: * * https://www.w3.org/TR/xml/#NT-STag * * https://www.w3.org/TR/xml/#NT-Name */ $tag_name_length = $this->parse_name( $at + 1 ); if ( false === $tag_name_length ) { return false; } if ( $tag_name_length > 0 ) { ++$at; $this->parser_state = self::STATE_MATCHED_TAG; $this->tag_name_starts_at = $at; $this->tag_name_length = $tag_name_length; $this->token_length = $this->tag_name_length; $this->bytes_already_parsed = $at + $this->tag_name_length; return true; } /* * Abort if no tag is found before the end of * the document. There is nothing left to parse. */ if ( $at + 1 >= $doc_length ) { $this->mark_incomplete_input( 'No more tags found before the end of the document.' ); return false; } /* * `<!` indicates one of a few possible constructs: */ if ( ! $this->is_closing_tag && '!' === $xml[ $at + 1 ] ) { /* * `<!--` mark a beginning of a comment. * https://www.w3.org/TR/xml/#sec-comments */ if ( $doc_length > $at + 3 && '-' === $xml[ $at + 2 ] && '-' === $xml[ $at + 3 ] ) { $closer_at = $at + 4; // If it's not possible to close the comment then there is nothing more to scan. if ( $doc_length <= $closer_at ) { $this->mark_incomplete_input( 'The document ends with a comment opener.' ); return false; } /* * Comments may only be closed by a --> sequence. */ --$closer_at; // Pre-increment inside condition below reduces risk of accidental infinite looping. while ( ++$closer_at < $doc_length ) { $closer_at = strpos( $xml, '--', $closer_at ); if ( false === $closer_at || $closer_at + 2 === $doc_length ) { $this->mark_incomplete_input( 'Unclosed comment.' ); return false; } /* * The string " -- " (double-hyphen) must not occur within comments * See https://www.w3.org/TR/xml/#sec-comments */ if ( '>' !== $xml[ $closer_at + 2 ] ) { $this->bail( 'Invalid comment syntax', self::ERROR_SYNTAX ); } $this->parser_state = self::STATE_COMMENT; $this->token_length = $closer_at + 3 - $this->token_starts_at; $this->text_starts_at = $this->token_starts_at + 4; $this->text_length = $closer_at - $this->text_starts_at; $this->bytes_already_parsed = $closer_at + 3; return true; } } /* * Identify CDATA sections. * * Within a CDATA section, everything until the ]]> string is treated * as data, not markup. Left angle brackets and ampersands may occur in * their literal form; they need not (and cannot) be escaped using "<" * and "&". CDATA sections cannot nest. * * See https://www.w3.org/TR/xml11.xml/#sec-cdata-sect */ if ( $doc_length > $this->token_starts_at + 8 && '[' === $xml[ $this->token_starts_at + 2 ] && 'C' === $xml[ $this->token_starts_at + 3 ] && 'D' === $xml[ $this->token_starts_at + 4 ] && 'A' === $xml[ $this->token_starts_at + 5 ] && 'T' === $xml[ $this->token_starts_at + 6 ] && 'A' === $xml[ $this->token_starts_at + 7 ] && '[' === $xml[ $this->token_starts_at + 8 ] ) { $closer_at = strpos( $xml, ']]>', $at + 1 ); if ( false === $closer_at ) { $this->mark_incomplete_input( 'Unclosed CDATA section' ); return false; } $this->parser_state = self::STATE_CDATA_NODE; $this->token_length = $closer_at + 1 - $this->token_starts_at; $this->text_starts_at = $this->token_starts_at + 9; $this->text_length = $closer_at - $this->text_starts_at; $this->bytes_already_parsed = $closer_at + 3; return true; } /* * Identify DOCTYPE nodes. * * doctypedecl ::= '<!DOCTYPE' S Name (S ExternalID)? S? ('[' intSubset ']' S?)? '>' * ExternalID ::= 'SYSTEM' S SystemLiteral | 'PUBLIC' S PubidLiteral S SystemLiteral * SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'") * PubidLiteral ::= '"' PubidChar* '"' | "'" (PubidChar - "'")* "'" * PubidChar ::= #x20 | #xD | #xA | [a-zA-Z0-9] | [-'()+,./:=?;!*#@$_%] * See https://www.w3.org/TR/xml11.html/#dtd */ if ( $doc_length > $this->token_starts_at + 8 && 'D' === $xml[ $at + 2 ] && 'O' === $xml[ $at + 3 ] && 'C' === $xml[ $at + 4 ] && 'T' === $xml[ $at + 5 ] && 'Y' === $xml[ $at + 6 ] && 'P' === $xml[ $at + 7 ] && 'E' === $xml[ $at + 8 ] ) { $at += 9; // Skip whitespace. $at += strspn( $this->xml, " \t\f\r\n", $at ); if ( $doc_length <= $at ) { $this->mark_incomplete_input( 'Unclosed DOCTYPE declaration.' ); return false; } // @TODO: Expose the "name" value instead of skipping it like that. $name_length = $this->parse_name( $at ); if ( false === $name_length ) { $this->mark_incomplete_input( 'Unclosed DOCTYPE declaration.' ); return false; } $this->doctype_name = new WP_HTML_Span( $at, $name_length ); $at += $name_length; // Skip whitespace. $at += strspn( $this->xml, " \t\f\r\n", $at ); if ( $doc_length <= $at ) { $this->mark_incomplete_input( 'Unclosed DOCTYPE declaration.' ); return false; } // Check for SYSTEM or PUBLIC identifiers. if ( $doc_length > $at + 6 && 'S' === $this->xml[ $at ] && 'Y' === $this->xml[ $at + 1 ] && 'S' === $this->xml[ $at + 2 ] && 'T' === $this->xml[ $at + 3 ] && 'E' === $this->xml[ $at + 4 ] && 'M' === $this->xml[ $at + 5 ] ) { $at += 6; // Skip whitespace. $at += strspn( $this->xml, " \t\f\r\n", $at ); // Parse the SystemLiteral token. $quoted_string_length = $this->parse_quoted_string( $at ); if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) { $this->mark_incomplete_input( 'Unclosed SYSTEM literal.' ); return false; } $this->system_literal = new WP_HTML_Span( // Start after the opening quote. $at + 1, // Exclude the closing quote. $quoted_string_length - 2 ); $at += $quoted_string_length; } elseif ( $doc_length > $at + 6 && 'P' === $this->xml[ $at ] && 'U' === $this->xml[ $at + 1 ] && 'B' === $this->xml[ $at + 2 ] && 'L' === $this->xml[ $at + 3 ] && 'I' === $this->xml[ $at + 4 ] && 'C' === $this->xml[ $at + 5 ] ) { $at += 6; // Skip whitespace. $at += strspn( $this->xml, " \t\f\r\n", $at ); /* * PubidLiteral ::= '"' PubidChar* '"' | "'" (PubidChar - "'")* "'" * PubidChar ::= #x20 | #xD | #xA | [a-zA-Z0-9] | [-'()+,./:=?;!*#@$_%] */ $opening_quote_char = $this->xml[ $at ]; if ( "'" !== $opening_quote_char && '"' !== $opening_quote_char ) { $this->bail( 'Unsupported DOCTYPE syntax. PUBLIC identifiers must be enclosed in double quotes.' ); return false; } $pubid_char = " \r\nabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-()+,./:=?;!*#@\$_%"; if ( "'" === $opening_quote_char ) { $pubid_char .= "'"; } $pubid_literal_length = strspn( $this->xml, $pubid_char, $at + 1 ); $this->pubid_literal = new WP_HTML_Span( $at + 1, $pubid_literal_length ); $at += $pubid_literal_length + 2; // Skip whitespace. $at += strspn( $this->xml, " \t\f\r\n", $at ); // Parse the SystemLiteral token. $quoted_string_length = $this->parse_quoted_string( $at ); if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) { $this->mark_incomplete_input( 'Unclosed SYSTEM literal.' ); return false; } $this->system_literal = new WP_HTML_Span( // Start after the opening quote. $at + 1, // Exclude the closing quote. $quoted_string_length - 2 ); $at += $quoted_string_length; } elseif ( '[' === $this->xml[ $at ] ) { $this->bail( 'Inline entity declarations are not yet supported in DOCTYPE declarations.', self::ERROR_SYNTAX ); } // Skip whitespace. $at += strspn( $this->xml, " \t\f\r\n", $at ); if ( '>' !== $this->xml[ $at ] ) { $this->bail( sprintf( 'Syntax error in DOCTYPE declaration. Unexpected character "%s" at position %d.', $this->xml[ $at ], $at ), self::ERROR_SYNTAX ); } $closer_at = $at; $this->parser_state = self::STATE_DOCTYPE_NODE; $this->token_length = $closer_at + 1 - $this->token_starts_at; $this->bytes_already_parsed = $closer_at + 1; return true; } /* * Anything else here is either unsupported at this point or invalid * syntax. See the class-level @TODO annotations for more information. */ $this->mark_incomplete_input( 'Unsupported <! syntax.' ); return false; } /* * An `<?xml` token at the beginning of the document marks a start of an * xml declaration. * See https://www.w3.org/TR/xml/#sec-prolog-dtd */ if ( 0 === $at && 0 === $this->upstream_bytes_forgotten && ! $this->is_closing_tag && '?' === $xml[ $at + 1 ] && 'x' === $xml[ $at + 2 ] && 'm' === $xml[ $at + 3 ] && 'l' === $xml[ $at + 4 ] ) { // Setting the parser state early for the get_attribute_by_qualified_name() calls later in this // branch. $this->parser_state = self::STATE_XML_DECLARATION; $at += 5; // Skip whitespace. $at += strspn( $this->xml, " \t\f\r\n", $at ); $this->bytes_already_parsed = $at; /* * Reuse parse_next_attribute() to parse the XML declaration attributes. * Technically, only "version", "encoding", and "standalone" are accepted * and, unlike regular tag attributes, their values can contain any character * other than the opening quote. However, the "<" and "&" characters are very * unlikely to be encountered and cause trouble, so this code path liberally * does not provide a dedicated parsing logic. */ while ( false !== $this->parse_next_attribute() ) { $this->skip_whitespace(); // Parse until the XML declaration closer. if ( '?' === $xml[ $this->bytes_already_parsed ] ) { break; } } if ( null !== $this->last_error ) { return false; } foreach ( $this->qualified_attributes as $name => $attribute ) { if ( 'version' !== $name && 'encoding' !== $name && 'standalone' !== $name ) { $this->bail( 'Invalid attribute found in XML declaration.', self::ERROR_SYNTAX ); return false; } } if ( '1.0' !== $this->get_qualified_attribute( 'version' ) ) { $this->bail( 'Unsupported XML version declared', self::ERROR_UNSUPPORTED ); return false; } /** * Standalone XML documents have no external dependencies, * including predefined entities like ` ` and `©`. * * See https://www.w3.org/TR/xml/#sec-predefined-ent. */ if ( null !== $this->get_qualified_attribute( 'encoding' ) && 'UTF-8' !== strtoupper( $this->get_qualified_attribute( 'encoding' ) ) ) { $this->bail( 'Unsupported XML encoding declared, only UTF-8 is supported.', self::ERROR_UNSUPPORTED ); return false; } if ( null !== $this->get_qualified_attribute( 'standalone' ) && 'YES' !== strtoupper( $this->get_qualified_attribute( 'standalone' ) ) ) { $this->bail( 'Standalone XML documents are not supported.', self::ERROR_UNSUPPORTED ); return false; } $at = $this->bytes_already_parsed; // Skip whitespace. $at += strspn( $this->xml, " \t\f\r\n", $at ); // Consume the closer. if ( ! ( $at + 2 <= $doc_length && '?' === $xml[ $at ] && '>' === $xml[ $at + 1 ] ) ) { $this->bail( 'XML declaration closer not found.', self::ERROR_SYNTAX ); return false; } $this->token_length = $at + 2 - $this->token_starts_at; $this->text_starts_at = $this->token_starts_at + 2; $this->text_length = $at - $this->text_starts_at; $this->bytes_already_parsed = $at + 2; $this->parser_state = self::STATE_XML_DECLARATION; // Processing instructions don't have namespaces. We can just // copy the qualified attributes to the attributes array without // resolving anything. $this->attributes = $this->qualified_attributes; $this->qualified_attributes = array(); return true; } /* * `<?` denotes a processing instruction. * See https://www.w3.org/TR/xml/#sec-pi */ if ( ! $this->is_closing_tag && '?' === $xml[ $at + 1 ] ) { if ( $at + 4 >= $doc_length ) { $this->mark_incomplete_input(); return false; } if ( ! ( ( 'x' === $xml[ $at + 2 ] || 'X' === $xml[ $at + 2 ] ) && ( 'm' === $xml[ $at + 3 ] || 'M' === $xml[ $at + 3 ] ) && ( 'l' === $xml[ $at + 4 ] || 'L' === $xml[ $at + 4 ] ) ) ) { $this->bail( 'Invalid processing instruction target.', self::ERROR_SYNTAX ); } $at += 5; // Skip whitespace. $this->skip_whitespace(); /* * Find the closer. * * We could, at this point, only consume the bytes allowed by the specification, that is: * * [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] // any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. * * However, that would require running a slow regular-expression engine for, seemingly, * little benefit. For now, we are going to pretend that all bytes are allowed until the * closing ?> is found. Some failures may pass unnoticed. That may not be a problem in practice, * but if it is then this code path will require a stricter implementation. */ $closer_at = strpos( $xml, '?>', $at ); if ( false === $closer_at ) { $this->mark_incomplete_input(); return false; } $this->parser_state = self::STATE_PI_NODE; $this->token_length = $closer_at + 5 - $this->token_starts_at; $this->text_starts_at = $this->token_starts_at + 5; $this->text_length = $closer_at - $this->text_starts_at; $this->bytes_already_parsed = $closer_at + 2; return true; } ++$at; } // There's no more tag openers and we're not expecting more data – // this mist be a trailing text node. if ( ! $this->expecting_more_input ) { $this->parser_state = self::STATE_TEXT_NODE; $this->token_starts_at = $was_at; $this->token_length = $doc_length - $was_at; $this->text_starts_at = $was_at; $this->text_length = $doc_length - $was_at; $this->bytes_already_parsed = $doc_length; return true; } /* * This does not imply an incomplete parse; it indicates that there * can be nothing left in the document other than a #text node. */ $this->mark_incomplete_input(); $this->token_starts_at = $was_at; $this->token_length = $doc_length - $was_at; $this->text_starts_at = $was_at; $this->text_length = $doc_length - $was_at; return false; } /** * Parses the next attribute. * * @return bool Whether an attribute was found before the end of the document. * @since WP_VERSION */ private function parse_next_attribute() { // Skip whitespace and slashes. $this->bytes_already_parsed += strspn( $this->xml, " \t\f\r\n/", $this->bytes_already_parsed ); if ( $this->bytes_already_parsed >= strlen( $this->xml ) ) { $this->mark_incomplete_input(); return false; } // No more attributes to parse. if ( '>' === $this->xml[ $this->bytes_already_parsed ] ) { return false; } $attribute_start = $this->bytes_already_parsed; $attribute_qname_length = $this->parse_name( $this->bytes_already_parsed ); if ( 0 === $attribute_qname_length ) { $this->bail( 'Invalid attribute name encountered.', self::ERROR_SYNTAX ); } $this->bytes_already_parsed += $attribute_qname_length; $attribute_qname = substr( $this->xml, $attribute_start, $attribute_qname_length ); $this->skip_whitespace(); // Parse attribute value. ++$this->bytes_already_parsed; $this->skip_whitespace(); if ( $this->bytes_already_parsed >= strlen( $this->xml ) ) { $this->mark_incomplete_input(); return false; } switch ( $this->xml[ $this->bytes_already_parsed ] ) { case "'": case '"': $quote = $this->xml[ $this->bytes_already_parsed ]; $value_start = $this->bytes_already_parsed + 1; /** * XML attributes cannot contain the characters "<" or "&". * * This only checks for "<" because it's reasonably fast. * Ampersands are actually allowed when used as the start * of an entity reference, but enforcing that would require * an expensive and complex check. It doesn't seem to be * worth it. * * @TODO: Discuss enforcing or abandoning the ampersand rule * and document the rationale. */ $value_length = strcspn( $this->xml, "<$quote", $value_start ); $attribute_end = $value_start + $value_length + 1; if ( $attribute_end - 1 >= strlen( $this->xml ) ) { $this->mark_incomplete_input(); return false; } if ( $this->xml[ $attribute_end - 1 ] !== $quote ) { $this->bail( 'A disallowed character encountered in an attribute value (either < or &).', self::ERROR_SYNTAX ); } $this->bytes_already_parsed = $attribute_end; break; default: $this->bail( 'Unquoted attribute value encountered.', self::ERROR_SYNTAX ); } if ( $attribute_end >= strlen( $this->xml ) ) { $this->mark_incomplete_input(); return false; } if ( $this->is_closing_tag ) { return true; } if ( array_key_exists( $attribute_qname, $this->qualified_attributes ) ) { $this->bail( 'Duplicate attribute found in an XML tag.', self::ERROR_SYNTAX ); } /** * Confirm the tag name is valid with respect to XML namespaces. * * @see https://www.w3.org/TR/2006/REC-xml-names11-20060816/#Conformance */ if ( false === $this->validate_qualified_name( $attribute_qname ) ) { return false; } /** * We must compute the namespace prefix and local name for each attribute * to assert there are no duplicate (local name, namespace) pairs in any * element. Note we must still keep track of string indices to support * replacements. */ list( $namespace_prefix, $local_name ) = $this->parse_qualified_name( $attribute_qname ); $this->qualified_attributes[ $attribute_qname ] = new XMLAttributeToken( $value_start, $value_length, $attribute_start, $attribute_end - $attribute_start, $namespace_prefix, $local_name /** * The full namespace is resolved in parse_next_token() once * all the attributes have been consumed. */ ); return true; } private function parse_quoted_string( $at = null ) { if ( null === $at ) { $at = $this->bytes_already_parsed; } $quote = $this->xml[ $at ]; if ( "'" !== $quote && '"' !== $quote ) { $this->bail( 'Invalid quote character encountered in an attribute value.', self::ERROR_SYNTAX ); } $value_length = strcspn( $this->xml, $quote, $at + 1 ); if ( $at + $value_length + 1 >= strlen( $this->xml ) ) { $this->mark_incomplete_input(); return false; } if ( $this->xml[ $at + $value_length + 1 ] !== $quote ) { $this->bail( 'A disallowed character encountered in an attribute value (either < or &).', self::ERROR_SYNTAX ); } return $value_length + 2; } /** * Move the internal cursor past any immediate successive whitespace. * * @since WP_VERSION */ private function skip_whitespace() { $this->bytes_already_parsed += strspn( $this->xml, " \t\f\r\n", $this->bytes_already_parsed ); } /** * Parses a Name token starting at $offset * * Name ::= NameStartChar (NameChar)* * * @param int $offset * * @return int */ private function parse_name( $offset ) { static $i = 0; $name_byte_length = 0; while ( true ) { /** * Parse the next unicode codepoint. * * We use a custom UTF-8 decoder here. No other method * is reliable and available enough to depend on it in * WordPress core: * * * mb_ord() – is not available on all hosts. * * iconv_substr() – is not available on all hosts. * * preg_match() – can fail with PREG_BAD_UTF8_ERROR when the input * contains an incomplete UTF-8 byte sequence – even * when that sequence comes after a valid match. This * failure mode cannot be reproduced with just any string. * The runtime must be in a specific state. It's unclear * how to reliably reproduce this failure mode in a * unit test. * * Performance-wise, character-by-character processing via utf8_codepoint_at * is still much faster than relying on preg_match(). The mbstring extension * is likely faster. It would be interesting to evaluate the performance * and prefer mbstring whenever it's available. */ $codepoint = utf8_codepoint_at( $this->xml, $offset + $name_byte_length, $bytes_parsed ); if ( // Byte sequence is not a valid UTF-8 codepoint. ( 0xFFFD === $codepoint && 0 === $bytes_parsed ) || // No codepoint at the given offset. null === $codepoint || // The codepoint is not a valid part of an XML NameChar or NameStartChar. ! $this->is_valid_name_codepoint( $codepoint, 0 === $name_byte_length ) ) { break; } $codepoint = null; $name_byte_length += $bytes_parsed; } return $name_byte_length; } private function is_valid_name_codepoint( $codepoint, $is_first_character = false ) { // Test against the NameStartChar pattern: // NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] // See `https://www.w3.org/TR/xml/#NT-Name`. if ( // :. ( 0x3A <= $codepoint && $codepoint <= 0x3A ) || // _. ( 0x5F <= $codepoint && $codepoint <= 0x5F ) || // A-Z. ( 0x41 <= $codepoint && $codepoint <= 0x5A ) || // a-z. ( 0x61 <= $codepoint && $codepoint <= 0x7A ) || // [#xC0-#xD6]. ( 0xC0 <= $codepoint && $codepoint <= 0xD6 ) || // [#xD8-#xF6]. ( 0xD8 <= $codepoint && $codepoint <= 0xF6 ) || // [#xF8-#x2FF]. ( 0xF8 <= $codepoint && $codepoint <= 0x2FF ) || // [#x370-#x37D]. ( 0x370 <= $codepoint && $codepoint <= 0x37D ) || // [#x37F-#x1FFF]. ( 0x37F <= $codepoint && $codepoint <= 0x1FFF ) || // [#x200C-#x200D]. ( 0x200C <= $codepoint && $codepoint <= 0x200D ) || // [#x2070-#x218F]. ( 0x2070 <= $codepoint && $codepoint <= 0x218F ) || // [#x2C00-#x2FEF]. ( 0x2C00 <= $codepoint && $codepoint <= 0x2FEF ) || // [#x3001-#xD7FF]. ( 0x3001 <= $codepoint && $codepoint <= 0xD7FF ) || // [#xF900-#xFDCF]. ( 0xF900 <= $codepoint && $codepoint <= 0xFDCF ) || // [#xFDF0-#xFFFD]. ( 0xFDF0 <= $codepoint && $codepoint <= 0xFFFD ) || // [#x10000-#xEFFFF]. ( 0x10000 <= $codepoint && $codepoint <= 0xEFFFF ) ) { return true; } if ( $is_first_character ) { return false; } // Test against the NameChar pattern: // NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040] // See `https://www.w3.org/TR/xml/#NT-Name`. return ( // "-". 45 === $codepoint || // ".". 46 === $codepoint || // [0-9]. ( 48 <= $codepoint && 57 >= $codepoint ) || // #xB7. 183 === $codepoint || // [#x0300-#x036F]. ( 0x0300 <= $codepoint && $codepoint <= 0x036F ) || // [#x203F-#x2040]. ( 0x203F <= $codepoint && $codepoint <= 0x2040 ) ); } /** * Applies attribute updates and cleans up once a tag is fully parsed. * * @since WP_VERSION */ private function after_tag() { /* * Purge updates if there are too many. The actual count isn't * scientific, but a few values from 100 to a few thousand were * tests to find a practically-useful limit. * * If the update queue grows too big, then the Tag Processor * will spend more time iterating through them and lose the * efficiency gains of deferring applying them. */ if ( 1000 < count( $this->lexical_updates ) ) { $this->get_updated_xml(); } foreach ( $this->lexical_updates as $name => $update ) { /* * Any updates appearing after the cursor should be applied * before proceeding, otherwise they may be overlooked. */ if ( $update->start >= $this->bytes_already_parsed ) { $this->get_updated_xml(); break; } if ( is_int( $name ) ) { continue; } $this->lexical_updates[] = $update; unset( $this->lexical_updates[ $name ] ); } $this->element = null; $this->token_starts_at = null; $this->token_length = null; $this->tag_name_starts_at = null; $this->tag_name_length = null; $this->text_starts_at = null; $this->text_length = null; $this->is_closing_tag = null; $this->pubid_literal = null; $this->system_literal = null; $this->attributes = array(); $this->qualified_attributes = array(); } /** * Applies lexical updates to XML document. * * @param int $shift_this_point Accumulate and return shift for this position. * * @return int How many bytes the given pointer moved in response to the updates. * @since WP_VERSION */ private function apply_lexical_updates( $shift_this_point = 0 ) { if ( ! count( $this->lexical_updates ) ) { return 0; } $accumulated_shift_for_given_point = 0; /* * Attribute updates can be enqueued in any order but updates * to the document must occur in lexical order; that is, each * replacement must be made before all others which follow it * at later string indices in the input document. * * Sorting avoid making out-of-order replacements which * can lead to mangled output, partially-duplicated * attributes, and overwritten attributes. */ usort( $this->lexical_updates, array( self::class, 'sort_start_ascending' ) ); $bytes_already_copied = 0; $output_buffer = ''; foreach ( $this->lexical_updates as $diff ) { $shift = strlen( $diff->text ) - $diff->length; // Adjust the cursor position by however much an update affects it. if ( $diff->start < $this->bytes_already_parsed ) { $this->bytes_already_parsed += $shift; } // Accumulate shift of the given pointer within this function call. if ( $diff->start <= $shift_this_point ) { $accumulated_shift_for_given_point += $shift; } $output_buffer .= substr( $this->xml, $bytes_already_copied, $diff->start - $bytes_already_copied ); $output_buffer .= $diff->text; $bytes_already_copied = $diff->start + $diff->length; } $this->xml = $output_buffer . substr( $this->xml, $bytes_already_copied ); /* * Adjust bookmark locations to account for how the text * replacements adjust offsets in the input document. */ foreach ( $this->bookmarks as $bookmark_name => $bookmark ) { $bookmark_end = $bookmark->start + $bookmark->length; /* * Each lexical update which appears before the bookmark's endpoints * might shift the offsets for those endpoints. Loop through each change * and accumulate the total shift for each bookmark, then apply that * shift after tallying the full delta. */ $head_delta = 0; $tail_delta = 0; foreach ( $this->lexical_updates as $diff ) { $diff_end = $diff->start + $diff->length; if ( $bookmark->start < $diff->start && $bookmark_end < $diff->start ) { break; } if ( $bookmark->start >= $diff->start && $bookmark_end < $diff_end ) { $this->release_bookmark( $bookmark_name ); continue 2; } $delta = strlen( $diff->text ) - $diff->length; if ( $bookmark->start >= $diff->start ) { $head_delta += $delta; } if ( $bookmark_end >= $diff_end ) { $tail_delta += $delta; } } $bookmark->start += $head_delta; $bookmark->length += $tail_delta - $head_delta; } $this->lexical_updates = array(); return $accumulated_shift_for_given_point; } /** * Checks whether a bookmark with the given name exists. * * @param string $bookmark_name Name to identify a bookmark that potentially exists. * * @return bool Whether that bookmark exists. * @since WP_VERSION */ public function has_bookmark( $bookmark_name ) { return array_key_exists( $bookmark_name, $this->bookmarks ); } /** * Move the internal cursor in the Tag Processor to a given bookmark's location. * * Be careful! Seeking backwards to a previous location resets the parser to the * start of the document and reparses the entire contents up until it finds the * sought-after bookmarked location. * * In order to prevent accidental infinite loops, there's a * maximum limit on the number of times seek() can be called. * * @param string $bookmark_name Jump to the place in the document identified by this bookmark name. * * @return bool Whether the internal cursor was successfully moved to the bookmark's location. * @since WP_VERSION */ public function seek( $bookmark_name ) { if ( ! array_key_exists( $bookmark_name, $this->bookmarks ) ) { _doing_it_wrong( __METHOD__, __( 'Unknown bookmark name.' ), 'WP_VERSION' ); return false; } if ( ++$this->seek_count > static::MAX_SEEK_OPS ) { _doing_it_wrong( __METHOD__, __( 'Too many calls to seek() - this can lead to performance issues.' ), 'WP_VERSION' ); return false; } // Flush out any pending updates to the document. $this->get_updated_xml(); // Point this tag processor before the sought tag opener and consume it. $this->bytes_already_parsed = $this->bookmarks[ $bookmark_name ]->start; $this->parser_state = self::STATE_READY; return $this->parse_next_token(); } /** * Compare two WP_HTML_Text_Replacement objects. * * @param WP_HTML_Text_Replacement $a First attribute update. * @param WP_HTML_Text_Replacement $b Second attribute update. * * @return int Comparison value for string order. * @since WP_VERSION */ private static function sort_start_ascending( $a, $b ) { $by_start = $a->start - $b->start; if ( 0 !== $by_start ) { return $by_start; } $by_text = isset( $a->text, $b->text ) ? strcmp( $a->text, $b->text ) : 0; if ( 0 !== $by_text ) { return $by_text; } /* * This code should be unreachable, because it implies the two replacements * start at the same location and contain the same text. */ return $a->length - $b->length; } /** * Return the enqueued value for a given attribute, if one exists. * * Enqueued updates can take different data types: * - If an update is enqueued and is boolean, the return will be `true` * - If an update is otherwise enqueued, the return will be the string value of that update. * - If an attribute is enqueued to be removed, the return will be `null` to indicate that. * - If no updates are enqueued, the return will be `false` to differentiate from "removed." * * @param string $comparable_name The attribute name in its comparable form. * * @return string|boolean|null Value of enqueued update if present, otherwise false. * @since WP_VERSION */ private function get_enqueued_attribute_value( $comparable_name ) { if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { return false; } if ( ! isset( $this->lexical_updates[ $comparable_name ] ) ) { return false; } $enqueued_text = $this->lexical_updates[ $comparable_name ]->text; // Removed attributes erase the entire span. if ( '' === $enqueued_text ) { return null; } /* * Boolean attribute updates are just the attribute name without a corresponding value. * * This value might differ from the given comparable name in that there could be leading * or trailing whitespace, and that the casing follows the name given in `set_attribute`. * * Example: * * $p->set_attribute( 'data-TEST-id', 'update' ); * 'update' === $p->get_enqueued_attribute_value( 'data-test-id' ); * * Detect this difference based on the absence of the `=`, which _must_ exist in any * attribute containing a value, e.g. `<input type="text" enabled />`. * ¹ ² * 1. Attribute with a string value. * 2. Boolean attribute whose value is `true`. */ $equals_at = strpos( $enqueued_text, '=' ); if ( false === $equals_at ) { return true; } /* * Finally, a normal update's value will appear after the `=` and * be double-quoted, as performed incidentally by `set_attribute`. * * e.g. `type="text"` * ¹² ³ * 1. Equals is here. * 2. Double-quoting starts one after the equals sign. * 3. Double-quoting ends at the last character in the update. */ $enqueued_value = substr( $enqueued_text, $equals_at + 2, - 1 ); /* * We're deliberately not decoding entities in attribute values: * * Attribute values must not contain direct or indirect entity references to external entities. * * See https://www.w3.org/TR/xml/#sec-starttags. */ return $enqueued_value; } /** * Returns the value of a requested attribute from a matched tag opener if that attribute exists. * * Example: * * $p = new XMLProcessor( * '<root xmlns:wp="http://www.w3.org/1999/xhtml"> * <content enabled="true" class="test" data-test-id="14" wp:enabled="true">Test</content> * </root>' * ); * $p->next_tag( array( 'class_name' => 'test' ) ) === true; * $p->get_attribute( '', 'data-test-id' ) === '14'; * $p->get_attribute( '', 'enabled' ) === "true"; * $p->get_attribute( 'wp', 'enabled' ) === null; * $p->get_attribute( 'http://www.w3.org/1999/xhtml', 'enabled' ) === "true"; * $p->get_attribute( 'aria-label' ) === null; * * @param string $namespace_reference Full namespace of the requested attribute, e.g. "http://wordpress.org/export/1.2/" * @param string $local_name Name of attribute whose value is requested, e.g. data-test-id * * @return string|true|null Value of attribute or `null` if not available. Boolean attributes return `true`. * @since WP_VERSION */ public function get_attribute( $namespace_reference, $local_name ) { if ( self::STATE_MATCHED_TAG !== $this->parser_state && self::STATE_XML_DECLARATION !== $this->parser_state ) { return null; } $full_name = $namespace_reference ? '{' . $namespace_reference . '}' . $local_name : $local_name; // Return any enqueued attribute value updates if they exist. $enqueued_value = $this->get_enqueued_attribute_value( $full_name ); if ( false !== $enqueued_value ) { return $enqueued_value; } if ( ! isset( $this->attributes[ $full_name ] ) ) { return null; } $attribute = $this->attributes[ $full_name ]; $raw_value = substr( $this->xml, $attribute->value_starts_at, $attribute->value_length ); $decoded = XMLDecoder::decode( $raw_value ); if ( ! isset( $decoded ) ) { /** * If the attribute contained an invalid value, it's * a fatal error. * * @see WP_XML_Decoder::decode() */ $this->last_error = self::ERROR_SYNTAX; _doing_it_wrong( __METHOD__, __( 'Invalid attribute value encountered.' ), 'WP_VERSION' ); return false; } return $decoded; } /** * Gets a value from a qualified attribute name if it exists in the * matched tag. * * It's for internal use only to source values of raw attributes * after they're parsed but before the namespaces are resolved. * * @param string $qname The qualified attribute name. * @return string|null The attribute value, or null if not found. */ private function get_qualified_attribute( $qname ) { if ( ! isset( $this->qualified_attributes[ $qname ] ) ) { return null; } $attribute = $this->qualified_attributes[ $qname ]; $raw_value = substr( $this->xml, $attribute->value_starts_at, $attribute->value_length ); $decoded = XMLDecoder::decode( $raw_value ); if ( ! isset( $decoded ) ) { /** * If the attribute contained an invalid value, it's * a fatal error. * * @see WP_XML_Decoder::decode() */ $this->last_error = self::ERROR_SYNTAX; _doing_it_wrong( __METHOD__, __( 'Invalid attribute value encountered.' ), 'WP_VERSION' ); return false; } return $decoded; } /** * Gets names of all attributes matching a given namespace prefix and local name prefix in the current tag. * * This method allows you to filter attributes by both their namespace prefix (e.g., 'wp') and the start of their local name (e.g., 'data-'). * Matching is case-sensitive for both namespace and local name prefixes, in accordance with the XML specification. * * Each returned attribute is represented as a two-element array: [namespace_prefix, local_name]. * * Examples: * * // No namespace, local name prefix only * $p = new XMLProcessor( '<content data-ENABLED="1" class="test" DATA-test-id="14">Test</content>' ); * $p->next_tag( array( 'class_name' => 'test' ) ) === true; * $p->get_attribute_names_with_prefix( '', 'data-' ); * // Returns: array( array( '', 'data-ENABLED' ), array( '', 'DATA-test-id' ) ) * * // With namespace prefix * $p = new XMLProcessor( '<content xmlns:wp="http://wordpress.org/export/1.2/" wp:data-foo="bar" wp:data-bar="baz" data-no-namespace="true" />' ); * $p->next_tag(); * $p->get_attribute_names_with_prefix( 'http://wordpress', 'data-' ); * // Returns: array( array( 'http://wordpress.org/export/1.2/', 'data-foo' ), array( 'http://wordpress.org/export/1.2/', 'data-bar' ) ) * * // Empty string namespace prefix matches all attributes. * $p->get_attribute_names_with_prefix( '', 'data-' ); * // Returns: array( array( 'http://wordpress.org/export/1.2/', 'data-foo' ), array( 'http://wordpress.org/export/1.2/', 'data-bar' ), array( '', 'data-no-namespace' ) ) * * // Null namespace prefix matches attributes with no namespace. * $p->get_attribute_names_with_prefix( null, 'data-' ); * // Returns: array( array( '', 'data-no-namespace' ) ) * * // No match for wrong namespace prefix * $p->get_attribute_names_with_prefix( 'other', 'data-' ); * // Returns: array() * * @param string $full_namespace_prefix Prefix of the fully qualified namespace to match on (e.g., 'http://wordpress.org/'). Use '' for no namespace prefix. * @param string $local_name_prefix Local name prefix to match (e.g., 'data-'). * * @return array|null List of [namespace, local_name] pairs, or `null` when no tag opener is matched. * @since WP_VERSION */ public function get_attribute_names_with_prefix( $full_namespace_prefix, $local_name_prefix ) { if ( self::STATE_MATCHED_TAG !== $this->parser_state || $this->is_closing_tag ) { return null; } $matches = array(); foreach ( $this->attributes as $attr ) { if ( 0 === strncmp( $attr->local_name, $local_name_prefix, strlen( $local_name_prefix ) ) && ( // Distinguish between no namespace and empty namespace. ( null === $full_namespace_prefix && '' === $attr->namespace ) || ( null !== $full_namespace_prefix && 0 === strncmp( $attr->namespace, $full_namespace_prefix, strlen( $full_namespace_prefix ) ) ) ) ) { $matches[] = array( $attr->namespace, $attr->local_name ); } } return $matches; } /** * Returns the local name of the matched tag. * * Example without namespaces: * * $p = new XMLProcessor( '<content class="test">Test</content>' ); * $p->next_tag() === true; * $p->get_tag_local_name() === 'content'; * * Example with namespaces: * * $p = new XMLProcessor( '<root xmlns:wp="http://www.w3.org/1999/xhtml"><wp:content>Test</wp:content></root>' ); * $p->next_tag() === true; * $p->get_tag_local_name() === 'content'; * * @return string|null Name of currently matched tag in input XML, or `null` if none found. * @since WP_VERSION */ public function get_tag_local_name() { if ( null !== $this->element ) { // Return cached name if we already have it. return $this->element->local_name; } $qualified_tag_name = $this->get_tag_name_qualified(); if ( null === $qualified_tag_name ) { return null; } list( $_, $local_name ) = $this->parse_qualified_name( $qualified_tag_name ); return $local_name; } /** * Returns the namespace prefix and the local name of the matched tag. * * Example without namespaces: * * $p = new XMLProcessor( '<content>Test</content>' ); * $p->next_tag() === true; * $p->get_tag_name_qualified() === 'content'; * * Example with namespaces: * * $p = new XMLProcessor( '<root xmlns:wp="http://www.w3.org/1999/xhtml"><wp:content>Test</wp:content></root>' ); * $p->next_tag() === true; * $p->get_tag_name_qualified() === 'wp:content'; * * @return string|null The namespace prefix and the local name of the matched tag, or null if not available. */ private function get_tag_name_qualified() { if ( null !== $this->element ) { // Return cached name if we already have it. return $this->element->qualified_name; } if ( null === $this->tag_name_starts_at ) { return null; } $tag_name = substr( $this->xml, $this->tag_name_starts_at, $this->tag_name_length ); if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { return null; } return $tag_name; } /** * Returns a string with the fully qualified namespace and local name of the matched tag * in the following format: "{namespace}local_name". * * Example: * * $p = new XMLProcessor( '<root xmlns:wp="http://www.w3.org/1999/xhtml"><wp:content>Test</wp:content></root>' ); * $p->next_tag() === true; * $p->get_tag_namespace_and_local_name() === '{http://www.w3.org/1999/xhtml"}content'; * * @return string|null The namespace and local name of the matched tag, or null if not available. */ public function get_tag_namespace_and_local_name() { $namespace = $this->get_tag_namespace(); if ( ! $namespace ) { return $this->get_tag_local_name(); } return '{' . $namespace . '}' . $this->get_tag_local_name(); } /** * Returns the namespace reference of the matched tag. * * Example: * * $p = new XMLProcessor( '<root xmlns:wp="http://www.w3.org/1999/xhtml"><wp:content>Test</wp:content></root>' ); * $p->next_tag() === true; * $p->next_tag() === true; * $p->get_tag_namespace() === 'http://www.w3.org/1999/xhtml'; * * @return string|null The namespace reference of the matched tag, or null if not available. */ public function get_tag_namespace() { if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { return null; } return $this->element->namespace; } /** * Returns the name from the DOCTYPE declaration. * * ``` * doctypedecl ::= '<!DOCTYPE' S Name (S ExternalID)? S? ('[' intSubset ']' S?)? '>' * ^^^^ * ``` * * @return string|null The name from the DOCTYPE declaration, or null if not available. * @since WP_VERSION */ public function get_doctype_name() { if ( null === $this->doctype_name ) { return null; } return substr( $this->xml, $this->doctype_name->start, $this->doctype_name->length ); } /** * Returns the system literal value from the DOCTYPE declaration. * * Example: * * <!DOCTYPE html SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> * * In this example, the system_literal would be: * "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" * * @return string|null The system literal value, or null if not available. * @since WP_VERSION */ public function get_system_literal() { if ( null === $this->system_literal ) { return null; } return substr( $this->xml, $this->system_literal->start, $this->system_literal->length ); } /** * Returns the public identifier value from the DOCTYPE declaration. * * Example: * * <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> * * In this example, the pubid_literal would be: * "-//W3C//DTD XHTML 1.0 Strict//EN" * * @return string|null The public identifier value, or null if not available. * @since WP_VERSION */ public function get_pubid_literal() { if ( null === $this->pubid_literal ) { return null; } return substr( $this->xml, $this->pubid_literal->start, $this->pubid_literal->length ); } /** * Indicates if the currently matched tag is expected to be closed. * Returns true for tag openers (<div>) and false for empty elements (<img />) and tag closers (</div>). * * This method exists to provide a consistent interface with WP_HTML_Processor. * * @return bool Whether the tag is expected to be closed. */ public function expects_closer() { if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { return false; } return $this->is_tag_opener() && ! $this->is_empty_element(); } /** * Indicates if the currently matched tag is an empty element tag. * * XML tags ending with a solidus ("/") are parsed as empty elements. They have no * content and no matching closer is expected. * * @return bool Whether the currently matched tag is an empty element tag. * @since WP_VERSION */ public function is_empty_element() { if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { return false; } /* * An empty element tag is defined by the solidus at the _end_ of the tag, not the beginning. * * Example: * * <figure /> * ^ this appears one character before the end of the closing ">". */ return '/' === $this->xml[ $this->token_starts_at + $this->token_length - 2 ]; } /** * Indicates if the current tag token is a tag closer. * * Example: * * $p = new XMLProcessor( '<content></content>' ); * $p->next_tag( array( 'tag_name' => 'content', 'tag_closers' => 'visit' ) ); * $p->is_tag_closer() === false; * * $p->next_tag( array( 'tag_name' => 'content', 'tag_closers' => 'visit' ) ); * $p->is_tag_closer() === true; * * @return bool Whether the current tag is a tag closer. * @since WP_VERSION */ public function is_tag_closer() { return ( self::STATE_MATCHED_TAG === $this->parser_state && $this->is_closing_tag ); } /** * Indicates if the current tag token is a tag opener. * * Example: * * $p = new XMLProcessor( '<content></content>' ); * $p->next_token(); * $p->is_tag_opener() === true; * * $p->next_token(); * $p->is_tag_opener() === false; * * @return bool Whether the current tag is a tag closer. * @since WP_VERSION */ public function is_tag_opener() { return ( self::STATE_MATCHED_TAG === $this->parser_state && ! $this->is_closing_tag && ! $this->is_empty_element() ); } /** * Indicates the kind of matched token, if any. * * This differs from `get_token_name()` in that it always * returns a static string indicating the type, whereas * `get_token_name()` may return values derived from the * token itself, such as a tag name or processing * instruction tag. * * Possible values: * - `#tag` when matched on a tag. * - `#text` when matched on a text node. * - `#cdata-section` when matched on a CDATA node. * - `#comment` when matched on a comment. * - `#presumptuous-tag` when matched on an empty tag closer. * * @return string|null What kind of token is matched, or null. * @since WP_VERSION */ public function get_token_type() { switch ( $this->parser_state ) { case self::STATE_MATCHED_TAG: return '#tag'; default: return $this->get_token_name(); } } /** * Returns the node name represented by the token. * * This matches the DOM API value `nodeName`. Some values * are static, such as `#text` for a text node, while others * are dynamically generated from the token itself. * * Dynamic names: * - Uppercase tag name for tag matches. * * Note that if the Tag Processor is not matched on a token * then this function will return `null`, either because it * hasn't yet found a token or because it reached the end * of the document without matching a token. * * @return string|null Name of the matched token. * @since WP_VERSION */ public function get_token_name() { switch ( $this->parser_state ) { case self::STATE_MATCHED_TAG: return $this->get_tag_local_name(); case self::STATE_TEXT_NODE: return '#text'; case self::STATE_CDATA_NODE: return '#cdata-section'; case self::STATE_DOCTYPE_NODE: return '#doctype'; case self::STATE_XML_DECLARATION: return '#xml-declaration'; case self::STATE_PI_NODE: return '#processing-instructions'; case self::STATE_COMMENT: return '#comment'; case self::STATE_COMPLETE: return '#complete'; case self::STATE_INVALID_DOCUMENT: return '#error'; default: return '#none'; } } /** * Returns the modifiable text for a matched token, or an empty string. * * Modifiable text is text content that may be read and changed without * changing the XML structure of the document around it. This includes * the contents of `#text` and `#cdata-section` nodes in the XML as well * as the inner contents of XML comments, Processing Instructions, and others. * * If a token has no modifiable text then an empty string is returned to * avoid needless crashing or type errors. An empty string does not mean * that a token has modifiable text, and a token with modifiable text may * have an empty string (e.g. a comment with no contents). * * @return string * @since WP_VERSION */ public function get_modifiable_text() { if ( null === $this->text_starts_at ) { return ''; } $text = substr( $this->xml, $this->text_starts_at, $this->text_length ); /* * > the XML processor must behave as if it normalized all line breaks in external parsed * > entities (including the document entity) on input, before parsing, by translating both * > the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single * > #xA character. * * See https://www.w3.org/TR/xml/#sec-line-ends */ $text = str_replace( array( "\r\n", "\r" ), "\n", $text ); // Comment data and CDATA sections contents are not decoded any further. if ( self::STATE_CDATA_NODE === $this->parser_state || self::STATE_COMMENT === $this->parser_state ) { return $text; } $decoded = XMLDecoder::decode( $text ); if ( ! isset( $decoded ) ) { /** * If the attribute contained an invalid value, it's * a fatal error. * * @see WP_XML_Decoder::decode() */ $this->last_error = self::ERROR_SYNTAX; _doing_it_wrong( __METHOD__, __( 'Invalid text content encountered.' ), 'WP_VERSION' ); return false; } return $decoded; } /** * Updates the modifiable text for a matched token. * * Modifiable text is text content that may be read and changed without * changing the XML structure of the document around it. This includes * the contents of `#text` and `#cdata-section` nodes in the XML as well * as the inner contents of XML comments, Processing Instructions, and others. * * If a token has no modifiable text then false is returned to avoid needless * crashing or type errors. * * @param $new_value New modifiable text for the current node. * @return string */ public function set_modifiable_text( $new_value ) { switch ( $this->parser_state ) { case self::STATE_TEXT_NODE: case self::STATE_COMMENT: $this->lexical_updates[] = new WP_HTML_Text_Replacement( $this->text_starts_at, $this->text_length, // @TODO: Audit this in details. Is this too naive? Or is it actually safe? htmlspecialchars( $new_value, ENT_XML1, 'UTF-8' ) ); return true; case self::STATE_CDATA_NODE: $this->lexical_updates[] = new WP_HTML_Text_Replacement( $this->text_starts_at, $this->text_length, // @TODO: Audit this in details. Is this too naive? Or is it actually safe? str_replace( ']]>', ']]>', $new_value ) ); return true; default: _doing_it_wrong( __METHOD__, __( 'Cannot set text content on a non-text node.' ), 'WP_VERSION' ); return false; } } /** * Updates or creates a new attribute on the currently matched tag with the passed value. * * For boolean attributes special handling is provided: * - When `true` is passed as the value, then only the attribute name is added to the tag. * - When `false` is passed, the attribute gets removed if it existed before. * * For string attributes, the value is escaped using the `esc_attr` function. * * @param string $xml_namespace The attribute's namespace. * @param string $local_name The attribute name to target. * @param string|bool $value The new attribute value. * * @return bool Whether an attribute value was set. * @since WP_VERSION */ public function set_attribute( $xml_namespace, $local_name, $value ) { if ( ! is_string( $value ) ) { _doing_it_wrong( __METHOD__, __( 'Non-string attribute values cannot be passed to set_attribute().' ), 'WP_VERSION' ); return false; } if ( 'xmlns' === $xml_namespace ) { $this->bail( __( 'Setting attributes in the xmlns namespace is not yet supported by set_attribute().' ), $xml_namespace ); return false; } if ( self::STATE_MATCHED_TAG !== $this->parser_state || $this->is_closing_tag ) { return false; } $value = htmlspecialchars( $value, ENT_XML1, 'UTF-8' ); if ( '' !== $xml_namespace ) { $prefix = $this->get_tag_namespace_prefix( $xml_namespace ); if ( false === $prefix ) { $this->bail( // Translators: 1: The XML namespace. __( 'The namespace "%1$s" is not in the current element\'s scope.' ), $xml_namespace ); return false; } $name = $prefix . ':' . $local_name; } else { $name = $local_name; } $updated_attribute = "{$name}=\"{$value}\""; /* * > An attribute name must not appear more than once * > in the same start-tag or empty-element tag. * - XML 1.0 spec * * @see https://www.w3.org/TR/xml/#sec-starttags */ if ( isset( $this->attributes[ $name ] ) ) { /* * Update an existing attribute. * * Example – set attribute id to "new" in <content id="initial_id" />: * * <content id="initial_id"/> * ^-------------^ * start end * replacement: `id="new"` * * Result: <content id="new"/> */ $existing_attribute = $this->attributes[ $name ]; $this->lexical_updates[ $name ] = new WP_HTML_Text_Replacement( $existing_attribute->start, $existing_attribute->length, $updated_attribute ); } else { /* * Create a new attribute at the tag's name end. * * Example – add attribute id="new" to <content />: * * <content /> * ^ * start and end * replacement: ` id="new"` * * Result: <content id="new"/> */ $this->lexical_updates[ $name ] = new WP_HTML_Text_Replacement( $this->tag_name_starts_at + $this->tag_name_length, 0, ' ' . $updated_attribute ); } return true; } /** * Remove an attribute from the currently-matched tag. * * @param string $xml_namespace The attribute's namespace. * @param string $local_name The attribute name to remove. * * @return bool Whether an attribute was removed. * @since WP_VERSION */ public function remove_attribute( $xml_namespace, $local_name ) { if ( self::STATE_MATCHED_TAG !== $this->parser_state || $this->is_closing_tag ) { return false; } $name = $xml_namespace ? '{' . $xml_namespace . '}' . $local_name : $local_name; /* * If updating an attribute that didn't exist in the input * document, then remove the enqueued update and move on. * * For example, this might occur when calling `remove_attribute()` * after calling `set_attribute()` for the same attribute * and when that attribute wasn't originally present. */ if ( ! isset( $this->attributes[ $name ] ) ) { if ( isset( $this->lexical_updates[ $name ] ) ) { unset( $this->lexical_updates[ $name ] ); } return false; } /* * Removes an existing tag attribute. * * Example – remove the attribute id from <content id="main"/>: * <content id="initial_id"/> * ^-------------^ * start end * replacement: `` * * Result: <content /> */ $this->lexical_updates[ $name ] = new WP_HTML_Text_Replacement( $this->attributes[ $name ]->start, $this->attributes[ $name ]->length, '' ); return true; } /** * Returns the string representation of the XML Tag Processor. * * @return string The processed XML. * @see XMLProcessor::get_updated_xml() * * @since WP_VERSION */ public function __toString() { return $this->get_updated_xml(); } /** * Returns the string representation of the XML Tag Processor. * * @return string The processed XML. * @since WP_VERSION */ public function get_updated_xml() { $requires_no_updating = 0 === count( $this->lexical_updates ); /* * When there is nothing more to update and nothing has already been * updated, return the original document and avoid a string copy. */ if ( $requires_no_updating ) { return $this->xml; } /* * Keep track of the position right before the current token. This will * be necessary for reparsing the current token after updating the XML. */ $before_current_token = isset( $this->token_starts_at ) ? $this->token_starts_at : 0; /* * 1. Apply the enqueued edits and update all the pointers to reflect those changes. */ $before_current_token += $this->apply_lexical_updates( $before_current_token ); /* * 2. Rewind to before the current tag and reparse to get updated attributes. * * At this point the internal cursor points to the end of the tag name. * Rewind before the tag name starts so that it's as if the cursor didn't * move; a call to `next_tag()` will reparse the recently-updated attributes * and additional calls to modify the attributes will apply at this same * location, but in order to avoid issues with subclasses that might add * behaviors to `next_tag()`, the internal methods should be called here * instead. * * It's important to note that in this specific place there will be no change * because the processor was already at a tag when this was called and it's * rewinding only to the beginning of this very tag before reprocessing it * and its attributes. * * <p>Previous XML<em>More XML</em></p> * ↑ │ back up by the length of the tag name plus the opening < * └←─┘ back up by strlen("em") + 1 ==> 3 */ $this->bytes_already_parsed = $before_current_token; $this->parse_next_token(); return $this->xml; } /** * Finds the next token in the XML document. * * An XML document can be viewed as a stream of tokens, * where tokens are things like XML tags, XML comments, * text nodes, etc. This method finds the next token in * the XML document and returns whether it found one. * * If it starts parsing a token and reaches the end of the * document then it will seek to the start of the last * token and pause, returning `false` to indicate that it * failed to find a complete token. * * Possible token types, based on the XML specification: * * - an XML tag * - a text node - the plaintext inside tags. * - a CData section * - an XML comment. * - a DOCTYPE declaration. * - a processing instruction, e.g. `<?xml mode="WordPress" ?>`. * * @return bool Whether a token was parsed. */ public function next_token() { return $this->step(); } /** * Moves the internal cursor to the next token in the XML document * according to the XML specification. * * It considers the current XML context (prolog, element, or misc) * and only expects the nodes that are allowed in that context. * * @param int $node_to_process Whether to process the next node or * reprocess the current node, e.g. using another parser context. * * @return bool Whether a token was parsed. * @since WP_VERSION * * @access private */ private function step( $node_to_process = self::PROCESS_NEXT_NODE ) { // Refuse to proceed if there was a previous error. if ( null !== $this->last_error ) { return false; } // Finish stepping when there are no more tokens in the document. if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state || self::STATE_COMPLETE === $this->parser_state ) { return false; } if ( self::PROCESS_NEXT_NODE === $node_to_process ) { if ( $this->is_empty_element() ) { array_pop( $this->stack_of_open_elements ); } } try { switch ( $this->parser_context ) { case self::IN_PROLOG_CONTEXT: return $this->step_in_prolog( $node_to_process ); case self::IN_ELEMENT_CONTEXT: return $this->step_in_element( $node_to_process ); case self::IN_MISC_CONTEXT: return $this->step_in_misc( $node_to_process ); default: $this->last_error = self::ERROR_UNSUPPORTED; return false; } } catch ( XMLUnsupportedException $e ) { /* * Exceptions are used in this class to escape deep call stacks that * otherwise might involve messier calling and return conventions. */ return false; } } /** * Parses the next node in the 'prolog' part of the XML document. * * @return bool Whether a node was found. * @see https://www.w3.org/TR/xml/#NT-document. * @see XMLProcessor::step * * @since WP_VERSION */ private function step_in_prolog( $node_to_process = self::PROCESS_NEXT_NODE ) { if ( self::PROCESS_NEXT_NODE === $node_to_process ) { $has_next_node = $this->parse_next_token(); if ( false === $has_next_node && ! $this->expecting_more_input ) { $this->bail( 'The root element was not found.', self::ERROR_SYNTAX ); } } // XML requires a root element. If we've reached the end of data in the prolog stage, // before finding a root element, then the document is incomplete. if ( self::STATE_COMPLETE === $this->parser_state ) { $this->mark_incomplete_input(); return false; } // Do not step if we paused due to an incomplete input. if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) { return false; } switch ( $this->get_token_type() ) { case '#text': $text = $this->get_modifiable_text(); $whitespaces = strspn( $text, " \t\n\r" ); if ( strlen( $text ) !== $whitespaces ) { // @TODO: Only look for this in the 2 initial bytes of the document:. if ( "\xFF\xFE" === substr( $text, 0, 2 ) ) { $this->bail( 'Unexpected UTF-16 BOM byte sequence (0xFFFE) in the document. XMLProcessor only supports UTF-8.', self::ERROR_SYNTAX ); } $this->bail( 'Unexpected non-whitespace text token in prolog stage.', self::ERROR_SYNTAX ); } return $this->step(); // @TODO: Fail if there's more than one <!DOCTYPE> or if <!DOCTYPE> was found before the XML declaration token. case '#doctype': case '#comment': case '#xml-declaration': case '#processing-instructions': return true; case '#tag': $this->parser_context = self::IN_ELEMENT_CONTEXT; return $this->step( self::PROCESS_CURRENT_NODE ); default: $this->bail( 'Unexpected token type in prolog stage.', self::ERROR_SYNTAX ); } } /** * Parses the next node in the 'element' part of the XML document. * * @return bool Whether a node was found. * @see https://www.w3.org/TR/xml/#NT-document. * @see XMLProcessor::step * * @since WP_VERSION */ private function step_in_element( $node_to_process = self::PROCESS_NEXT_NODE ) { if ( self::PROCESS_NEXT_NODE === $node_to_process ) { $has_next_node = $this->parse_next_token(); if ( false === $has_next_node && ! $this->expecting_more_input ) { $this->bail( 'A tag was not closed.', self::ERROR_SYNTAX ); } } // Do not step if we paused due to an incomplete input. if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) { return false; } switch ( $this->get_token_type() ) { case '#text': case '#cdata-section': case '#comment': case '#processing-instructions': return true; case '#tag': // Update the stack of open elements. $tag_qname = $this->get_tag_name_qualified(); if ( $this->is_tag_closer() ) { if ( ! count( $this->stack_of_open_elements ) ) { $this->bail( // Translators: 1: The closing tag name. 2: The opening tag name. __( 'The closing tag "%1$s" did not match the opening tag "%2$s".' ), $tag_qname, $tag_qname ); return false; } $this->element = array_pop( $this->stack_of_open_elements ); $popped_qname = $this->element->qualified_name; if ( $popped_qname !== $tag_qname ) { $this->bail( sprintf( // translators: %1$s is the name of the closing HTML tag, %2$s is the name of the opening HTML tag. __( 'The closing tag "%1$s" did not match the opening tag "%2$s".' ), $tag_qname, $popped_qname ), self::ERROR_SYNTAX ); } if ( 0 === count( $this->stack_of_open_elements ) ) { $this->parser_context = self::IN_MISC_CONTEXT; } } else { array_push( $this->stack_of_open_elements, $this->element ); $this->element = $this->top_element(); } return true; default: $this->bail( sprintf( // translators: %1$s is the unexpected token type. __( 'Unexpected token type "%1$s" in element stage.', 'data-liberation' ), $this->get_token_type() ), self::ERROR_SYNTAX ); } } /** * Parses the next node in the 'misc' part of the XML document. * * @return bool Whether a node was found. * @see https://www.w3.org/TR/xml/#NT-document. * @see XMLProcessor::step * * @since WP_VERSION */ private function step_in_misc( $node_to_process = self::PROCESS_NEXT_NODE ) { if ( self::PROCESS_NEXT_NODE === $node_to_process ) { $has_next_node = $this->parse_next_token(); if ( false === $has_next_node && ! $this->expecting_more_input ) { // Parsing is complete. $this->parser_state = self::STATE_COMPLETE; return true; } } // Do not step if we paused due to an incomplete input. if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) { return false; } if ( self::STATE_COMPLETE === $this->parser_state ) { return true; } switch ( $this->get_token_type() ) { case '#comment': case '#processing-instructions': return true; case '#text': $text = $this->get_modifiable_text(); $whitespaces = strspn( $text, " \t\n\r" ); if ( strlen( $text ) !== $whitespaces ) { $this->bail( 'Unexpected token type "' . $this->get_token_type() . '" in misc stage.', self::ERROR_SYNTAX ); } return $this->step(); default: $this->bail( 'Unexpected token type "' . $this->get_token_type() . '" in misc stage.', self::ERROR_SYNTAX ); } } /** * Computes the XML breadcrumbs for the currently-matched element, if matched. * * Breadcrumbs start at the outermost parent and descend toward the matched element. * They always include the entire path from the root XML node to the matched element. * Example * * $processor = XMLProcessor::create_fragment( '<p><strong><em><img/></em></strong></p>' ); * $processor->next_tag( 'img' ); * $processor->get_breadcrumbs() === array( 'p', 'strong', 'em', 'img' ); * * @return string[]|null Array of tag names representing path to matched node, if matched, otherwise NULL. * @since WP_VERSION */ public function get_breadcrumbs() { return array_map( function ( $element ) { return array( $element->namespace, $element->local_name ); }, $this->stack_of_open_elements ); } /** * Indicates if the currently-matched tag matches the given breadcrumbs. * * A "*" represents a single tag wildcard, where any tag matches, but not no tags. * * At some point this function _may_ support a `**` syntax for matching any number * of unspecified tags in the breadcrumb stack. This has been intentionally left * out, however, to keep this function simple and to avoid introducing backtracking, * which could open up surprising performance breakdowns. * * Example: * * $processor = new XMLProcessor( '<root><post><content><image /></content></post></root>' ); * $processor->next_tag( 'img' ); * true === $processor->matches_breadcrumbs( array( 'content', 'image' ) ); * true === $processor->matches_breadcrumbs( array( 'post', 'content', 'image' ) ); * false === $processor->matches_breadcrumbs( array( 'post', 'image' ) ); * true === $processor->matches_breadcrumbs( array( 'post', '*', 'image' ) ); * * @param string[] $breadcrumbs DOM sub-path at which element is found, e.g. `array( 'content', 'image' )`. * May also contain the wildcard `*` which matches a single element, e.g. `array( 'post', '*' )`. * * @return bool Whether the currently-matched tag is found at the given nested structure. * @since WP_VERSION */ public function matches_breadcrumbs( $breadcrumbs ) { // Everything matches when there are zero constraints. if ( 0 === count( $breadcrumbs ) ) { return true; } // Start at the last crumb. $crumb = end( $breadcrumbs ); if ( '#tag' !== $this->get_token_type() ) { return false; } $open_elements = $this->stack_of_open_elements; $crumb_count = count( $breadcrumbs ); $elem_count = count( $open_elements ); // Walk backwards through both arrays, matching each crumb to the corresponding open element. for ( $j = 1; $j <= $crumb_count; $j++ ) { $crumb = $breadcrumbs[ $crumb_count - $j ]; $element = isset( $open_elements[ $elem_count - $j ] ) ? $open_elements[ $elem_count - $j ] : null; if ( ! $element ) { return false; } // Normalize crumb to [namespace, local_name]. if ( ! is_array( $crumb ) ) { if ( '*' === $crumb ) { $crumb = array( '*', '*' ); } else { $crumb = array( '*', $crumb ); } } list( $namespace, $local_name ) = $crumb; // Match local name, respecting wildcard '*'. if ( '*' !== $local_name && $local_name !== $element->local_name ) { return false; } // Match namespace, respecting wildcard '*'. if ( '*' !== $namespace && $namespace !== $element->namespace ) { return false; } } return true; } /** * Returns the nesting depth of the current location in the document. * * Example: * * $processor = new XMLProcessor( '<?xml version="1.0" ?><root><text></text></root>' ); * 0 === $processor->get_current_depth(); * * // Opening the root element increases the depth. * $processor->next_tag(); * 1 === $processor->get_current_depth(); * * // Opening the text element increases the depth. * $processor->next_tag(); * 2 === $processor->get_current_depth(); * * // The text element is closed during `next_token()` so the depth is decreased to reflect that. * $processor->next_token(); * 1 === $processor->get_current_depth(); * * @return int Nesting-depth of current location in the document. * @since WP_VERSION */ public function get_current_depth() { return count( $this->stack_of_open_elements ); } /** * Parses a qualified name into a namespace prefix and local name. * * Example: * * $this->parse_qualified_name( 'wp:post' ); // Returns array( 'wp.org', 'post' ) * $this->parse_qualified_name( 'image' ); // Returns array( '', 'image' ) * * @param string $qualified_name The qualified name to parse. * * @return array<string, string> The namespace prefix and local name. */ private function parse_qualified_name( $qualified_name ) { $namespace_prefix = ''; $local_name = $qualified_name; $prefix_length = strcspn( $qualified_name, ':' ); if ( null !== $prefix_length && strlen( $qualified_name ) !== $prefix_length ) { $namespace_prefix = substr( $qualified_name, 0, $prefix_length ); $local_name = substr( $qualified_name, $prefix_length + 1 ); } return array( $namespace_prefix, $local_name ); } /** * Asserts a qualified tag name is syntactically valid according to the * XML specification. * * @param string $qualified_name The qualified name to validate. * @return bool Whether the qualified name is syntactically valid. */ private function validate_qualified_name( $qualified_name ) { if ( substr_count( $qualified_name, ':' ) > 1 ) { $this->bail( sprintf( 'Invalid identifier "%s" – more than one ":" in tag name. Every tag name must contain either zero or one colon.', $qualified_name ), self::ERROR_SYNTAX ); return false; } $prefix_length = strcspn( $qualified_name, ':' ); if ( 0 === $prefix_length && strlen( $qualified_name ) > 0 ) { $this->bail( sprintf( 'Invalid identifier "%s" – namespace qualifier must not have zero length.', $qualified_name ), self::ERROR_SYNTAX ); return false; } return true; } private function mark_incomplete_input( $error_message = 'Unexpected syntax encountered.' ) { if ( $this->expecting_more_input ) { $this->parser_state = self::STATE_INCOMPLETE_INPUT; return; } $this->parser_state = self::STATE_INVALID_DOCUMENT; $this->last_error = self::ERROR_SYNTAX; _doing_it_wrong( __METHOD__, $error_message, 'WP_VERSION' ); } /** * Returns context for why the parser aborted due to unsupported XML, if it did. * * This is meant for debugging purposes, not for production use. * * @return XMLUnsupportedException|null */ public function get_exception() { return $this->exception; } /** * Stops the parser and terminates its execution when encountering unsupported markup. * * @param string $message Explains support is missing in order to parse the current node. * * @throws XMLUnsupportedException Halts execution of the parser. */ private function bail( $message, $reason = self::ERROR_UNSUPPORTED ) { $starts_at = isset( $this->token_starts_at ) ? $this->token_starts_at : strlen( $this->xml ); $length = isset( $this->token_length ) ? $this->token_length : 0; $token = substr( $this->xml, $starts_at, $length ); $this->last_error = $reason; $this->exception = new XMLUnsupportedException( $message, $this->get_token_type(), $starts_at, $token, $this->get_breadcrumbs() ); throw $this->exception; } /** * Parser Ready State. * * Indicates that the parser is ready to run and waiting for a state transition. * It may not have started yet, or it may have just finished parsing a token and * is ready to find the next one. * * @since WP_VERSION * * @access private */ const STATE_READY = 'STATE_READY'; /** * Parser Complete State. * * Indicates that the parser has reached the end of the document and there is * nothing left to scan. It finished parsing the last token completely. * * @since WP_VERSION * * @access private */ const STATE_COMPLETE = 'STATE_COMPLETE'; /** * Parser Incomplete Input State. * * Indicates that the parser has reached the end of the document before finishing * a token. It started parsing a token but there is a possibility that the input * XML document was truncated in the middle of a token. * * The parser is reset at the start of the incomplete token and has paused. There * is nothing more than can be scanned unless provided a more complete document. * * @since WP_VERSION * * @access private */ const STATE_INCOMPLETE_INPUT = 'STATE_INCOMPLETE_INPUT'; /** * Parser Invalid Input State. * * Indicates that the parsed xml document contains malformed input and cannot be parsed. * * @since WP_VERSION * * @access private */ const STATE_INVALID_DOCUMENT = 'STATE_INVALID_DOCUMENT'; /** * Parser Matched Tag State. * * Indicates that the parser has found an XML tag and it's possible to get * the tag name and read or modify its attributes (if it's not a closing tag). * * @since WP_VERSION * * @access private */ const STATE_MATCHED_TAG = 'STATE_MATCHED_TAG'; /** * Parser Text Node State. * * Indicates that the parser has found a text node and it's possible * to read and modify that text. * * @since WP_VERSION * * @access private */ const STATE_TEXT_NODE = 'STATE_TEXT_NODE'; /** * Parser CDATA Node State. * * Indicates that the parser has found a CDATA node and it's possible * to read and modify its modifiable text. Note that in XML there are * no CDATA nodes outside of foreign content (SVG and MathML). Outside * of foreign content, they are treated as XML comments. * * @since WP_VERSION * * @access private */ const STATE_CDATA_NODE = 'STATE_CDATA_NODE'; /** * Parser DOCTYPE Node State. * * Indicates that the parser has found a DOCTYPE declaration and it's possible * to read and modify its modifiable text. * * @since WP_VERSION * * @access private */ const STATE_DOCTYPE_NODE = 'STATE_DOCTYPE_NODE'; /** * Indicates that the parser has found an XML processing instruction. * * @since WP_VERSION * * @access private */ const STATE_PI_NODE = 'STATE_PI_NODE'; /** * Indicates that the parser has found an XML declaration * * @since WP_VERSION * * @access private */ const STATE_XML_DECLARATION = 'STATE_XML_DECLARATION'; /** * Indicates that the parser has found an XML comment and it's * possible to read and modify its modifiable text. * * @since WP_VERSION * * @access private */ const STATE_COMMENT = 'STATE_COMMENT'; /** * Indicates that the parser encountered unsupported syntax and has bailed. * * @since WP_VERSION * * @var string */ const ERROR_SYNTAX = 'syntax'; /** * Indicates that the provided XML document contains a declaration that is * unsupported by the parser. * * @since WP_VERSION * * @var string */ const ERROR_UNSUPPORTED = 'unsupported'; /** * Indicates that the parser encountered more XML tokens than it * was able to process and has bailed. * * @since WP_VERSION * * @var string */ const ERROR_EXCEEDED_MAX_BOOKMARKS = 'exceeded-max-bookmarks'; /** * Indicates that we're parsing the `prolog` part of the XML * document. * * @since WP_VERSION * * @access private */ const IN_PROLOG_CONTEXT = 'prolog'; /** * Indicates that we're parsing the `element` part of the XML * document. * * @since WP_VERSION * * @access private */ const IN_ELEMENT_CONTEXT = 'element'; /** * Indicates that we're parsing the `misc` part of the XML * document. * * @since WP_VERSION * * @access private */ const IN_MISC_CONTEXT = 'misc'; /** * Indicates that the next HTML token should be parsed and processed. * * @since WP_VERSION * * @var string */ const PROCESS_NEXT_NODE = 'process-next-node'; /** * Indicates that the current HTML token should be processed without advancing the parser. * * @since WP_VERSION * * @var string */ const PROCESS_CURRENT_NODE = 'process-current-node'; /** * Unlock code that must be passed into the constructor to create this class. * * This class extends the WP_HTML_Tag_Processor, which has a public class * constructor. Therefore, it's not possible to have a private constructor here. * * This unlock code is used to ensure that anyone calling the constructor is * doing so with a full understanding that it's intended to be a private API. * * @access private */ const CONSTRUCTOR_UNLOCK_CODE = 'Use WP_HTML_Processor::create_fragment() instead of calling the class constructor directly.'; }