InstructPatentGPT: Training patent language models to follow instructions with human feedback

In this research, patent prosecution is conceptualized as a system of reinforcement learning from human feedback. The objective of the system is to increase the likelihood for a language model to generate patent claims that have a higher chance of being granted. To showcase the controllability of the language model, the system learns from granted patents and pre-grant applications with different rewards. The status of"granted"and"pre-grant"are perceived as labeled human feedback implicitly. In addition, specific to patent drafting, the experiments in this research demonstrate the model's capability to learn from adjusting claim length and inclusion of limiting terms for narrowing claim scope. As proof of concept, the experiments focus on claim ones only and the training data originates from a patent dataset tailored specifically for artificial intelligence. Although the available human feedback in patent prosecution are limited and the quality of generated patent text requires improvement, the experiments following the 3-stage reinforcement learning from human feedback have demonstrated that generative language models are capable of reflecting the human feedback or intent in patent prosecution. To enhance the usability of language models, the implementation in this research utilizes modern techniques that enable execution on a single consumer-grade GPU. The demonstrated proof of concept, which reduces hardware requirements, will prove valuable in the future as more human feedback in patent prosecution become available for broader use, either within patent offices or in the public domain.


Introduction
The codename "InstructPatentGPT" in this research refers to the development of aligning language models in the patent domain with human feedback in patent prosecution.This research draws inspiration from InstructGPT [1], which has successfully shown the effectiveness of Reinforcement Learning from Human Feedback (RLHF).In the patent domain, patent prosecution is the process of obtaining a patent for an invention, which includes tasks like drafting a patent application, issuing an office action, revising the patent application, and responding to the office action, along with other associated tasks.An office action is a written notice issued by a patent examiner to the patent applicant, addressing matters of patentability.Issuing and responding to office actions is an iterative back-and-forth process between patent examiners and patent applicants.The objective for patent applicants in general is to maximize the probability of patent allowance with preferred patent scope.In computer science, reinforcement learning also involves an iterative process with an objective to maximize its rewards.Hence, the idea is to merge these two areas and implement RLHF in the context of patent prosecution.
In essence, patent prosecution encompasses substantial human feedback, including opinions issued by patent examiners on patentability, revisions made by patent applicants to their patent applications, and the final determination for each patent claim: whether it is granted, rejected or abandoned.In addition, the implicit intent in claim drafting, such as adjusting claim length for different patent scope, utilizing certain terms more or less frequently, also represent distinct human preferences that can be factored into reinforcement learning.Drawing from these observations, it can be stated that the training data for RLHF already exists, although a significant portion is not publicly available or in text format.When considering the effectiveness of RLHF, it is also captivating to investigate whether RLHF can utilize the human feedback in patent prosecution to generate patent text that can maximize the probability of patent allowance.
In this research, the language model for reinforcement learning is the PatentGPT-J-6B model [2].According to [3], the model has been pre-trained from scratch exclusively using patent data.To the best of the author's knowledge, this research is the first time RLHF techniques are being used on language models in the patent domain.To serve as proof of concept and to reduce the entry barriers for experiments, the implementation in this research concentrates exclusively on the first claim of patent applications.Given the limitations imposed by publicly available data, the human feedback used in this study is somewhat limited; nevertheless, it remains sufficient to show the proof of concept.As for training data, the primary source of raw data for this research is the Artificial Intelligence Patent Dataset (AIPD) [4], which was made publicly available by the United States Patent and Trademark Office (USPTO).Selecting this dataset is intended to make the patent claims generated more understandable for readers knowledgeable in AI.Moreover, to promote the accessibility of language models and lowering hardware demands, this study's implementation utilizes modern techniques to reduce the model's size, enabling it to run on a single consumer-grade GPU.

Related Work
Generative models have demonstrated notable efficacy across diverse domains in recent years.However, their potential applications within the patent domain remain comparatively underexplored.In [5], the authors focused on fine-tuning OpenAI GPT-2 [6] models for patent claim generation.In [7], they explored the control of patent text generation through the use of structural metadata in patents.Despite the proposal for personalized patent claim generation in another study [8], the methods for achieving controlled text generation remain unclear.Also, in [9], the authors utilized generative language models for large-scale text analysis and discovering public value expressions in AI patents.The effectiveness of employing a generative language model (GPT-4) [10] for generating labels and rationales was demonstrated by the authors in [9].The approach offers advantages because labeling data is often difficult to accomplish accurately.Moving to a more specific technical field, in [11], the authors developed a framework to extract molecular structures from the USPTO patents and trained domain-specific generative RNN models to generate novel molecular structures.Notwithstanding these notable efforts, the exploration of generative models in the patent domain remains limited.
RLHF is an approach that combines supervised learning with reinforcement learning, in which a reinforcement learning agent learns how to maximize a reward from human feedback.Some prominent instances of RLHF-trained language models are OpenAI's ChatGPT [12] and its predecessor InstructGPT [1], as well as DeepMind's Sparrow [13] and Google's Bard [14].To maximize the reward, RLHF is also about learning an optimal policy model guiding the agent's actions.A policy is a mapping from the current environment observation to a probability distribution of the actions (or tokens, in the case of language models) to be taken.A specific optimization algorithm employed to train the optimal policy model is Proximal Policy Optimization (PPO) [15], which was developed by OpenAI.This research utilizes the PPO algorithm to maximize agent's rewards from human feedback.The agent in the context of this research is the generative language model.It is worth noting that training language models with RLHF has gained widespread acceptance in mainstream research, and there are comprehensive and readily accessible resources detailing this technique, e.g., [16] and [17].Nevertheless, despite the popularity of RLHF in various domains, there is currently no known application of this method in the patent domain, as per the author's knowledge.
It is possible that the primary obstacle to implementing RLHF in the patent domain is the lack of labeled data.Generally, in many fields, acquiring human feedback involves a laborious process of manually labeling data.However, the patent domain may have a unique advantage in this regard, as public patent data and prosecution history inherently contain human feedback.For example, the patent examiner's response to grant or reject a patent claim serves as a form of human feedback.The revisions made to patent claims during patent prosecution may encompass other forms of human feedback or intent.The patent data used in this research originates from the USPTO.The USPTO offers several data sources, including the Patent Public Search website for end users, the PatentsView [18] website as a data visualization and analysis platform, the Bulk Data Storage System [19] providing a repository for raw public bulk data, and the Patent Examination Data System [20] allowing users to search, display and download multiple records of patent application, status, and transaction history.
In addition, the USPTO provides several research datasets [21].The primary raw data in this research comes from the AIPD dataset [4] and the PatentsView platform [18].The dataset, as per the Office of the Chief Economist (OCE) at the USPTO, aims to assist researchers and policymakers in focusing on the determinants and impacts of AI invention.Further details about the AIPD dataset and the PatentsView platform are available in working papers [22] and [23] respectively.It can be noted that the AIPD was used in the USPTO report "Inventing AI: Tracing the diffusion of artificial intelligence with U.S. patents."[24].Despite the significance of artificial intelligence, to the author's knowledge, the AIPD dataset has not been utilized to train language models in the patent domain, and particularly not for reinforcement learning from human feedback.

Human feedback
The primary challenge in deploying RLHF revolves around obtaining human feedback effectively.The human feedback in RLHF can take various types, including but not limited to: (a) preference ratings, (b) summarizations, (c) corrections, (d) demonstrations, and (e) specific reward signals.Typically, obtaining human feedback in most domains involves time-consuming manual data labeling.However, public patent data and prosecution history present a distinct advantage as they inherently include human feedback or intent.For instance, when it comes to (a) preference ratings, a granted patent claim is considered preferred, whereas a rejected claim is not.In terms of (b) summarizations, the abstract of a granted patent can be derived from the patent's description or claims.Additionally, (c) corrections can be identified through revised patent claims, which may address issues such as antecedent-basis error encountered during patent prosecution.Regarding (d) demonstrations, existing dependent claims can serve as examples to illustrate how independent claims derive dependent claims.
When it comes to patent drafting and human intent, it is desirable to incorporate the drafting intent in controlling patent text generation.For instance, considering (e) specific reward signals, if the goal is to achieve a broader patent scope, generating shorter patent claims with fewer limitations can represent a higher reward signal.Conversely, when the drafting intent is to avoid anticipation of prior arts and make it easier to be granted (at the cost of potentially lower patent value), longer patent claims might be favored and represent a higher reward signal.In summary, the human feedbacks mentioned earlier in types (a)∼(d) can be derived from patent data and prosecution history.Additionally, in the case of type (e), it is desirable for generative language models to be controllable and capable of reflecting the intent behind patent drafting.Owing to limitations in resources and the availability of public data, this research concentrates solely on implementing human feedback in types (a) and (e).More details are provided in Section 3.3.

Methodology
The methodology in this manuscript follows the typical 3-stage RLHF pipeline outlined in [25] and [1].Specifically, the stages of training a patent language model from human feedback are: 1. Supervised Fine-Tuning (SFT).Fine-tune a pretrained language model with a domain-specific dataset.In this manuscript, the pretrained language model is PatentGPT-J-6B and the dataset is the AIPCO (AI Patent's Claim Ones) dataset in section 3.3.2. Reward Model (RM).Collect a dataset with human feedback and train a reward model.In the patent domain, the human feedback or intent can be categorized in several types, as described in section 3.1.3. Proximal Policy Optimization (PPO).Optimize a policy against the reward model by using PPO.PPO is a reinforcement learning algorithm that can be used to learn a policy that maximizes a scalar reward.The reward model's output in step 2 is considered as this scalar reward.Alternatively, in different experiments, the scalar reward can be defined through a reward function specified in source code.
Fine-tuning large language models is often prohibitively costly, and maintaining fine-tuned models of the same size as the original pretrained model can be also expensive.To address these issues, researchers have introduced parameter-efficient fine-tuning (PEFT) techniques [26].These techniques aim to enable efficient adaptation of pre-trained language models to various downstream applications without the need to fine-tune all of the model's parameters.The concept is to add and fine-tune only a small number of extra parameters while freezing most parameters of the pretrained models.This approach leads to substantial reductions in computational and storage costs.For instance, a cutting-edge PEFT method known as Low-Rank Adaptation (LoRA) [27] has demonstrated performance similar to that of full fine-tuning.This research utilizes LoRA to efficiently adapt the pre-trained GPT-J-6B model.Moreover, to enable fine-tuning of language models on a single consumer-grade GPU (e.g., VRAM = 16G or 24G), this research leverages the techniques of 8-bit optimization via block-wise quantization [28].

Dataset
This research relies on two primary sources of raw data provided by the USPTO: AIPD [4] and PatentsView [18].AIPD offers information and categorization of AI patents, while PatentsView provides details pertaining to patent documents, such as patent claims.According to [4], there exists a data file in the AIPD that identifies U.S. patents issued between 1976 and 2020 and pre-grant publications that contain one or more of eight AI technology components.These AI components are defined as: machine learning, evolutionary computation, natural language processing, speech, computer vision, knowledge processing, planning and control, and AI hardware.The authors in [22] generated this data file using a machine learning approach that analyzed patent text and citations to identify AI components in U.S. patent documents.This research follows the naming convention of the AI components in the data file as: ML, EVO, NLP, SPEECH, VISION, KR, PLANNING, and HARDWARE.To conduct the experiments in this research, eight training datasets are created, each corresponding to one of these eight AI components.In the AIPD data file, a document id can take one of two forms: (1) a patent number for granted patents, or (2) a publication number if the document is a published patent application (pre-grant).Additionally, the data file contains an application id, which represents the application number of a patent application.While the AIPD data file is helpful for identifying the AI categories of a patent document, it does not include the actual textual content of the document, such as the title, abstract, description, and claims.
Researchers have two sources of data for accessing the textual content of patent documents: the PatentsView platform and the Bulk Data Storage System (BDSS).One key distinction between PatentsView and BDSS is their data structure: BDSS is document-centric, while PatentsView is database-centric.In BDSS, a single file in XML format contains all the textual data and metadata for a given patent.On the other hand, the textual data for a patent is spread across multiple database table files at PatentsView.These individual table files can be imported and combined to create a comprehensive database.To achieve quicker iterations and facilitate model training, this research concentrates on the textual data of claim one.Consequently, the PatentsView platform is a better option for accessing the patent claims and integrating them with the AIPD data.PatentsView provides downloadable table files for both granted patents and pre-grant applications respectively and on a yearly basis.The relation between granted patents and pre-grant applications can be identified by another table file called pg granted pgpubs crosswalk mapping patent application numbers to their corresponding granted patent numbers.
It is worth mentioning that a database dump from PatentsView is accessible upon request, which can simplify the process of creating a database.However, upon inspection, the patent claim text is not included in the database dump due to the substantial volume of patent claims.The version of the inspected database dump is dated as of March 30, 2023.It would be beneficial if a newer version of database dump released by the USPTO could include patent claims in the future.By integrating the AIPD data file having eight AI technology components and the database tables from PatentsView (granted patents, pre-grant applications, and crosswalk ), the datasets needed in this research are created encompassing all patents in AIPD along with their corresponding text of patent claim one.The training datasets are given the prefix AIPCO.Since AIPD comprises eight AI technology components, eight individual datasets are constructed for the experiments in this research, each representing a specific component along with its corresponding text of patent claim one.These datasets are named as follows: AIPCO-ML, AIPCO-EVO, AIPCO-NLP, AIPCO-SPEECH, AIPCO-VISION, AIPCO-KR, AIPCO-PLANNING, and AIPCO-HARDWARE.In the context of the methodology described in section 3.2, during the SFT stage, these eight datasets are used for fine-tuning.Subsequently, in the RM stage, the eight datasets are used for training reward models too.The eight SFT models and reward models are then used in the PPO stage of experiment in section 4.3.
Table 1 presents the statistics for the eight datasets.The first column displays the names of the databases, followed by the total number of rows in the second column.
The third column represents the total count of granted patents, and the fourth column is the average length of those patents.The fifth column represents the total count of pre-grant applications, and the sixth column is the average length of those applications.It is worth mentioning that the total count of pre-grant applications being less than the count of granted patents is attributed to the crosswalk data.For some granted patents in the data, such as reissued patents, the pre-grant application number is empty.For some other rows, the reasons for this emptiness are less evident.While a further investigation might be needed to understand these reasons, for the purposes of this research, the numerical difference in rows between pre-grant and granted is not a major concern for training models.It is noted that the Office Action Research Dataset for Patents [29] provided in [21] has the potential to enhance this research in the future.The dataset marks the first time that comprehensive data on examiner-issued rejections are available to the research community.As previously stated, an office action is a written notice to the patent applicant of the patent examiner's decision on patentability.Therefore, the notice generally discloses information, such as the grounds for a rejection, the claims affected, and the pertinent prior art.According to [30], the relative inaccessibility of office actions has prevented researchers from fully exploiting valuable information during patent prosecution.The authors in [30] aim to rectify the situation by using natural language processing and machine learning techniques to systematically extract information from office actions and construct a relational database of key data elements.The dataset covers 4.4 million office actions mailed during the 2008 to mid-2017 period from USPTO examiners to the applicants of 2.2 million unique patent applications.
From the perspective in section 3.1, office actions encompass various types of human feedback.For instance, a rejection can be categorized as (a) preference rating as described in section 3.1.How the claims are affected and revised can be considered as (c) corrections.Furthermore, the pertinent prior art can play a role in (b) summarizations used for summarizing the basis in office actions.Ideally, the office actions serve as the most valuable source for obtaining the necessary human feedback for RLHF.However, upon close inspection, the dataset's coverage does not align with the primary data source AIPD in this research.In addition, the rejections in the dataset lack the patent text essential for training language models.As a result, for training reward models in experiment of section 4.3, pre-grant applications are chosen as the source for negative samples, and granted patents serve as positive samples.Further elaboration will be provided in that section.It is also noted that the Patent Examination Data System [20], another data source from the USPTO, has comparable limitations and does not fulfill the requirements of this research.If either the Office Action Research Dataset or the Patent Examination Data System could offer more structured and comprehensive data in the future, an enhanced quantity of human feedback during patent prosecution might be available for leveraging in RLHF within the patent domain.In summary, a total of eight AIPCO datasets, each containing the text of patent claim ones, are prepared for each AI component.

Library
The implementation in this research leverages several open source libraries in Python, particularly the TRL (Transformer Reinforcement Learning) library [31] and its examples.TRL is a full stack library providing a set of tools to train transformer language models with reinforcement learning.The library covers all three stages: SFT, RM, and PPO, as described in section 3.2.The TRL library is built on top of the transformers library by Hugging Face.Therefore, pre-trained language models, such as PatentGPT-J-6B [2] and DistilBERT [32], can be directly loaded.Throughout the research, TRL version 0.4.2.dev0 was utilized, while the library remained under intensive development.Other options for implementing reinforcement learning in the language domain include: TRLX (Transformer Reinforcement Learning X) [33], TextRL (Text Generation with Reinforcement Learning) [34], and RL4LMs [35].Amidst the fast-paced development in applying RLHF techniques to language models, it is advised to observe which library stands out as the most promising in the future.
To enable the fine-tuning of language models on a single GPU with limited 16G VRAM, this research utilizes the LoRA method as low-rank adaptation from the PEFT library [26].The LoRA method proves to be effective in reducing the number of trainable parameters by using parameter-efficient fine-tuning techniques.As shown in [26], using LoRA on consumer hardware yields performance comparable to that of full fine-tuning which demands high-end hardware.Furthermore, this research reduces the model size by loading a model in 8-bit precision through the bitsandbytes library [36].The library is a lightweight wrapper around CUDA (Compute Unified Device Architecture) custom functions, 8-bit optimizers, matrix multiplication, and quantization functions.CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements.It is a proprietary software layer developed by NVIDIA.

Training
Based on the methodology described in section 3.2, the first stage (SFT) involves fine-tuning the pretrained language model PatentGPT-J-6B using the eight AIPCO datasets in section 3.3.During fine-tuning, the PatentGPT-J-6B model is loaded in 8-bit precision and with the low-rank adapter in the LoRA method.The adapter adds pairs of rank-decomposition weight matrices to the 8-bit model, and only these newly added weights are fine-tuned.After fine-tuning, an additional step is taken to merge the adapter weights to the original model.This merging of weights results in the creation of the SFT model.Each of the eight AIPCO datasets is used to individually fine-tune the PatentGPT-J-6B model, resulting in the creation of eight domain-specific SFT models for the subsequent stage.Each model's fine-tuning is completed in a single epoch.The perplexity values at the end of the fine-tuning for each dataset are shown in Table 2. Perplexity is a statistical measure of how confidently a language model predicts a text sample.The lower the perplexity value, the better the model can predict the next word or sequence of words in a given text.The following perplexity values are considered low, which suggests that the SFT models are effective in predictive capabilities.The second stage (RM) involves the training of a reward model using human feedback.This research explores two separate implementation approaches for this stage.The first approach pertains to (a) preference ratings in section 3.1.In this approach, the categorization of granted and pre-grant represents a form of human feedback.The implicit human feedback has been supervised and does not require further labeling efforts.The base model utilized for training in this stage is the distilbert-base model [32].The downstream task for the base model is a binary classification task (granted or pre-grant).A reward of 1 is assigned to instances classified as granted, while a reward of 0 is given to instances classified as pre-grant.More information about how this reward model is experimented in subsequent reinforcement learning (PPO) can be found in section 4.4.
Returning to the base model, the distilbert-base model is a distilled version of the BERT base model.It reduces the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster, according to [37].With a size of 66M, it can avoid an out-of-memory (OOM) issue and becomes practical in this research to execute the reward model alongside the SFT model during the subsequent reinforcement learning stage on a consumer-grade GPU.The same eight AIPCO datasets used in the SFT stage are also employed in the RM stage.As a result, eight reward models are created after fine-tuning the distilbert-base model with the AIPCO datasets individually.The accuracy results of these eight reward models are presented in Table 3.Each reward model underwent training for one epoch.The AIPCO dataset was split into 90% for training, 5% for validation, and 5% for testing.For demonstrative purposes, the performance of these reward models is considered adequate for building a prototype of RLHF.Assuming GPU VRAM is not limiting, it is speculated that using the PatentGPT-J-6B model as the base model may yield improved performance.Nonetheless, confirming this assumption will require additional resources and further investigation in the future.
In the RM stage, the second approach of reward implementation focuses on (e) specific reward signals as discussed in section 3.1.This approach expands upon the The reward functions are designed to reflect the underlying intent behind patent claim drafting and determine the reward accordingly.In this research, three reward functions are implemented and tested.Generally, for patent practitioners, shorter patent claims are preferred as they offer a broader scope for the patent.In contrast, longer patent claims may increase the likelihood of obtaining a patent allowance.During patent prosecution, there are two main ways a patent examiner might reject a patent application over prior art.One is anticipation.The other is obviousness.It is generally easier for longer claims to be granted because they are more likely to avoid anticipation of prior arts and reduce the likelihood of the obviousness rejection.Nevertheless, as a patent claim becomes longer, its scope tends to be narrow.From this practical aspect, attaining the ability to control the length or scope of generated patent claims is highly desirable.To the author's knowledge, prior to this research, no previous attempts have been made to control the length of patent text generation.
In section 4, the first reward function depends on the length of patent claims, and it computes the reward value based on a designated maximum length.If the generated patent claim exceeds the maximum length, the reward will be set to zero.Within the specified maximal length, longer patent claims will receive higher rewards.This reward function guides the subsequent PPO algorithm to learn generating longer patent claims, while attempting to abide by the maximum length constraint.The main objective of the experiment in section 4.1 is to assess the viability of controlling patent text generation using RLHF.Additional details concerning this specific reward function can be found in that section.The second reward function focuses on controllability based on patent scope.When drafting patents, the inclusion of limiting terms (e.g., "wherein") serves to narrow the scope of a patent claim.A patent with more limiting terms and limiting clauses generally has a narrower scope.Consequently, this may increase the chances of the patent being allowed, while reducing the likelihood of patent infringement.In the experiment detailed in section 4.2, the reward function is designed to count the occurrences of such limiting terms.A higher reward value is assigned when there are more limiting terms.Further information about this reward function will be provided in the same section.Regarding the third reward function, it combines the previous two to calculate a joint reward value based on both the length of patent claims and the count of limiting terms.The specifics of this joint reward function, along with the experimental details and results, can be found in section 4.3.
Returning to the 3-stage methodology, the third stage involves training and using PPO as the reinforcement learning algorithm to optimize the SFT model from the first stage.This optimization is done against either a reward model obtained in the second stage or a reward function defined.It is noted that a reward model requires training and data, whereas a reward function does not.Reward models are neural-based and obtained through training, while reward functions are rule-based and defined directly in source code.Despite the differences, both types of rewards yield numeric values as rewards, which allows them to be mathematically combined in the training.This amalgamation of neural-based rewards and rule-based rewards is expected to have broad applicability, encompassing a wider range of use cases in the future.
For example, to be patentable, a patent claim must fulfill at least three essential requirements: novelty, non-obviousness, and utility.Theoretically, training three reward models, each corresponding to a specific requirement, along with defining a reward function to assess the patent claim's length, would make it possible to train a policy model that can generate patentable patent claims within a predefined length.In this research, a reward model is trained in section 4.4.The joint reward for section 4.3 is composed of the reward functions in sections 4.1 and 4.2.Technically, it is also feasible to have a joint reward from the reward model in 4.4 and the reward function in sections 4.1, 4.2, or 4.3.Nonetheless, before proceeding with any further joint reward, it is crucial to conduct separate validation on the quality of patent text generation using these reward models or functions.Future research needs to delve into exploring the combination of the aforementioned reward model and reward functions.

Release
Upon the publication of this manuscript, the source code, datasets, SFT models, reward models, policy models, and experimental results will be made accessible to the public.

Experimental results
This chapter presents a series of RLHF experiments, each examining a different reward model or reward function.The first experiment in section 4.1 implements a reward function based on claim length.The second section 4.2 implements a reward function based on the number of limiting terms in patent claims.The third section 4.3 implements a joint reward by combining the first and the second reward functions.The fourth section 4.4 implements a reward model based on the classification of granted patents and pre-grant applications.Each of these experiments also includes subsequent reinforcement learning during the PPO stage, aimed at training a policy model in alignment with the respective reward function or reward model.Regarding computational demands, after using 8-bit quantization, experiments in sections 4.1, 4.2, and 4.3 requires a GPU with 16G of VRAM.In contrast, the experiment in section 4.4 requires a GPU with 24G of VRAM because of the reward model.To facilitate the peer review process of this research, the patent claims generated in all experiments can be accessed at [38].These patent claims will be made public after the publication of this research.A selection of exemplary patent claims, including both higher and lower rewards, can be found in Appendix.

Experiment 1: based on claim length
According to [39], "patent prosecutors and examiners have long assumed a link between claim length and patent validity.The conventional wisdom is embodied in the so-called pencil test, which predict that patent claims that can be covered by a pencil, are unlikely to be both valid and infringed."From this perspective, a longer patent claim is more likely to be valid but less likely to be infringed.A shorter patent claim is less likely to be valid but more likely to be infringed.In [40], the first large-scale analysis of patent claim length and patent scope, the authors validated that independent claim length is negatively correlated with patent scope.According to the authors, the validation also shows that independent claim length independently explain other measures of patent scope that have been used in the literature: patent maintenance, forward citations, and the breadth of patent classes.Hence, the conventional wisdom mentioned in [39] is empirically true.In [40], it is noted that the average lengths of patent claims, measured in words, are 94.In the code snippet, the variable max len is defined to set the upper threshold for the permissible length of the generated text in terms of characters.If the length of the generated patent text surpasses this threshold, the text is shortened.Upon truncation, if the text lacks the specific tag < |end of claim| >, a reward of zero is assigned and it indicates that the initial generated text has exceeded the upper threshold.In contrast, when the mentioned tag is present, the reward is computed using the formula 1+len(s) / float(max len), where a greater reward corresponds to longer text as a proportion of the maximum length.
Fig. 1 shows the quantitative outcomes obtained with a maximum length of 512.The SFT model for reinforcement learning is the ML model in Table 2.The graph labeled as (a) in Fig. 1 depicts the progression of reward mean values throughout the PPO training, covering a span of 10,000 training steps.The curve ascends and exceeds a value of 1, signifying the policy's acquisition of the ability to produce patent claims with an average length that approaches but remains below 512.The graph labeled as (b) illustrates the average length of generated patent claims.As depicted, there is a gradual decline in the average length.Based on graph (a), graph (b) suggests that the policy is progressively becoming more adept at producing patent claims with lengths that fall below 512 characters.Graph (c) presents the number of limiting terms in the generated patent claims.This graph will be cross-referenced in the forthcoming experiment in section 4.2, which focuses on the reward function using limiting terms.
Both graphs (b) and (c) also include plots of the trend using moving averages with a window size of 100.  1.In Fig. 2, the significance and findings drawn from all three graphs parallel those observed in the corresponding counterparts within Fig. 1.Hence, repetitive explanations are omitted here for brevity.For qualitative analysis in the future, readers with an interest can refer to the exemplary patent claims of higher and lower rewards in Appendix A or all generated patent claims in this research at [38].

Experiment 2: based on limiting terms
A limiting clause in a patent claim expresses one or more inventive aspects of the invention on which the patent was conditioned and allowed.Typically, the term "wherein" denotes such a limiting clause and narrows the scope of the patent claim during patent prosecution.For example, a claim might read, "A device comprising A, B, and C, wherein C is made of material X."In this instance, wherein limits the scope of C to being made of a specific material, X.A narrower claim scope increases the likelihood of allowance, albeit it decreases the likelihood of being infringed.If the goal is to increase the chance of patent allowance, including more "wherein" clauses may be preferable.Conversely, if the intention is to have a broader claim scope, it's generally advisable to use fewer "wherein" clauses.A well-crafted patent claim should strike a balance between a broader claim scope and the likelihood of allowance, or between a narrower claim scope and the likelihood of infringement.In patent litigation, a defendant might occasionally argue that the "wherein" clause is not limiting because it merely stated the intended results and was not material to patentability.For example, according to [41], in Case No. 2018-2207 (Fed.Cir.Aug. 29, 2019), the defendant argued that, without the limiting effect of the "wherein" clause, the resulting broader claims were invalid as obvious based on prior rulings.In fact, the "wherein" clauses of the patents in suit referenced efficacy and safety for a method of treatment.Therefore, the district court found the disputed "wherein" clauses to constitute claim limitations because "they were material to patentability and expressed the inventive aspect of the claimed invention."The US Court of Appeals for the Federal Circuit upheld the district court's finding that the disputed "wherein" clauses were indeed limiting.
Ideally, an effective patent claim should contain neither too many nor too few limiting terms.Nevertheless, achieving this equilibrium through RLHF is a complex endeavor at this current research stage.Instead, the present experiment aims to explore whether the policy model in RLHF can learn against a reward function reliant on the count of limiting terms.By validating the controllability of the policy model over the use of limiting terms, it might become feasible to train the policy model to maintain a suitable equilibrium in subsequent research.For this experiment, the limiting terms are: wherein, where, when and whereby.This list is not exhaustive, and additional terms could be incorporated following further study in the future.The reward function in this experiment is outlined in Listing 2 below.The SFT model applied for reinforcement learning is identical to the ML model in Experiment 4.1.
Listing 2: Reward based on limiting terms In this experiment, a training consisting of 2,500 steps is sufficient to validate the policy's controllability over the use of limiting terms.Illustrated in Fig. 3, graph (a) demonstrates a progressive rise in the mean reward value as training advances.This reward value corresponds to the count of limiting terms.Meanwhile, graph (b) showcases a gradual increase in the length of generated patent claims over the training steps.The policy model generating longer patent claims as the number of limiting terms increases is a logical outcome.Both graphs (b) and (c) include plots of the trend using moving averages with a window size of 100.It's worth observing that in section 4.1, the curves within graph (c) appear relatively flat.This is attributed to the absence of counting the number of limiting terms in the reward function.Regarding the quality of text generation in this experiment, it is noticeable that higher rewards could potentially result in a decline in text quality.A thorough qualitative analysis would necessitate further efforts, which exceed the resources available in this research.Nevertheless, readers with an interest can refer to the exemplary patent claims with both higher and lower rewards in Appendices B.1 and B.2.Alternatively, all the generated patent claims are accessible online at [38].To ensure the reproducibility of the policy's controllability over the use of limiting terms, the next experiment employs the third model (NLP) in Table 2 as the SFT model for reinforcement learning.Another aim of the experiment is to investigate the upper bounds of claim length and the number of limiting terms depicted in graphs (b) and (c).Interestingly, as depicted in Fig. 4, graph (a) illustrates that the mean reward value surpasses over 50 and then diminishes.Correspondingly, graph (b) demonstrates an initial ascent followed by a descent in claim length, while graph (c) also demonstrates an initial ascent followed by a descent in the number of limiting terms.This outcome is unexpected since rewards were anticipated to consistently increase.
Upon inspecting the patent claims generated at different training steps, it was observed that the maximum number of limiting terms reached 173 at training step 3699 (see Appendix B.4).At training step 454, the number of limiting terms was 3, and the generated patent claim is relatively favorable (see Appendix B.3).In contrast, at training step 4500, the reward was down to zero, as shown in Appendix B.5, demonstrating an evidently unfavorable outcome.The phenomenon of the model's collapse after reaching its peak reward is perplexing.It necessitates subsequent inquiry in the future.Another notable point gleaned from inspection is the decline in the quality of generated patent claims as their length extends.The correlation observed indicates

Experiment 3: a joint reward function
The joint reward function in this experiment combines the reward functions in sections 4.1 (using max len=1024 ) and 4.2.The source code is outlined in Listing 3 below.The training steps have been increased to 10,000.The goal is to validate the controllability of the policy over both the claim length and the use of limiting terms.The SFT models applied for reinforcement learning are the eight models shown in Table 2. Appendix C shows the file names in [38] containing generated patent claims for each model.The graphical representation of claim lengths and the number of limiting terms are depicted in Fig. 5 for each model.To ensure conciseness, the curves depicting reward mean values have been omitted.
Listing 3: a joint reward function Fig. 5 illustrates that the claim length is constrained by the upper bound, as showcased in Fig. 2. As a result, the claim length does not exhibit a continuous increase, contrasting with the trend observed in Fig. 3. Additionally, the same figure demonstrates a gradual rise in the number of limiting terms, even while adhering to the claim length constraint.This phenomenon can be rationalized by considering the model's pursuit of higher rewards.As the inclusion of more limiting terms results in higher rewards and given that longer claims can still achieve a minimum reward of zero, the model tends to overlook claim length and prioritizes the addition of more limiting terms.Hence, the length does not exhibit continuous growth as seen in Fig. 3. Simultaneously, Fig. 5 demonstrates that the number of limiting terms increases over time, even while adhering to the limitation imposed by claim length.Ultimately, similar to the findings in Fig. 4, the model inexplicably experiences a breakdown after extended training.Alongside this qualitative analysis, it's important to highlight that the challenge lies in conducting quantitative analysis.Such an endeavor would demand substantial efforts from patent practitioners in the future, a resource allocation that exceeds the scope of this current research.

Experiment 4: based on granted or pre-grant
This experiment involves the implementation of both the RM stage and the PPO stage described in section 3.2.At the RM stage, training a reward model is to train a distilbert-base model for a binary classification task (granted = 1 or pre-grant = 0 ).It is noted that pre-grant applications usually have shorter claims compared to granted patents.This is because inventors and patent practitioners often initially aim for broader patent scopes and subsequently extend the length of the claims to narrow their scopes.The purpose is to overcome any prior art identified by patent examiners later, but only as necessary.This heuristic can be confirmed by the research in [40].In Fig. 3(a) of [40], the authors showcase a comparison of claim length trends between patent applications and issued patents for the years 2001 to 2014.Three different types of documents are compared: (1) published applications that are later abandoned, (2) published applications that are later granted, and (3) granted patents.The average length observed in ( 1) is consistently shorter than that in (2), and similarly, the average length in (2) is always shorter than that in (3).The authors conclude that the claims of granted patents are narrower in scope than those that are published and granted, and these in turn are narrower than those that are published and abandoned.
In this manuscript, the training data for the reward models come from the eight datasets listed in Table 1.As a result, eight distinct reward models are trained.Their performances are provided in Table 3. Subsequently, the PPO stage builds upon the SFT models that have been previously fine-tuned and uses these reward models to train eight policy models in reinforcement learning.Because of the context window size of the distilbert-base model, the reward model's input tokens are capped at 500 as a maximum.The reward value to compute in PPO is the accuracy of the classification task based on the reward model.Training a single policy model for 10,000 training steps takes approximately 5 days using an NVIDIA L4 GPU with a VRAM of 24G.This training process could be accomplished using a consumer-grade GPU like the RTX 4090, which also has 24GB of VRAM.The mean reward values for each model are illustrated in Fig. 6.To maintain brevity, the two graphs showing claim length and the number of limiting terms are omitted.Nevertheless, regarding the two graphs, it is noted that the graph of claim length shows a gradual and slight increase.This observation is reasonable as lengthier patent claims typically have a higher probability of being granted for patent practitioners.This observation is also supported by Table 1, in which the average length of granted patents exceeds that of pre-grant applications.Another observation was that the curve depicting the number of limiting terms is relatively flat.This finding is reasonable due to the absence of any reward function to include limiting terms.
To assess the efficacy of RLHF, a comparative examination is performed to contrast patent allowances before and after PPO training, based on the prediction of the reward model.Prior to PPO training, the SFT models were fine-tuned using the AIPCO datasets.The likelihood of generating a granted patent without a prompt should resemble the ratio of the number of granted patents in each dataset.After PPO training, the policy model's propensity to produce granted patents is expected to rise  4 are in accordance with this expectation.
In Table 4, the first column presents the names of the datasets.Within this experiment, an assessment is conducted on the initial 1,000 rows within each dataset.The subsequent two columns display the counts of both granted and pre-grant records, as determined by the respective reward model associated with each dataset.In the fourth column, the ratio of granted relative to the total of 1,000 rows is displayed.
The second, third, and fourth columns present the results before the PPO training.These results are based on SFT models and RM models, without PPO models.Similarly, the fifth, sixth, and seventh columns convey comparable findings, representing the results after the PPO training.The results are thus based on policy models in RLHF and RM models.Significantly, the ratio after PPO training is notably higher than the ratio observed before PPO training.This outcome provides strong validation for the efficacy of RLHF in this experiment.
A technical detail within this experiment is that, for each of the 1,000 rows, the first 30 tokens of the patent claim are extracted and utilized as the prompt for the policy model to generate patent text.With this setting, an assumption is made that the 1,000 prompts collected for the policy model will exhibit sufficient diversity.Additionally, it is assumed that the initial 30 tokens used as prompts should not be decisive, and they should not unilaterally determine the outcome in terms of being predicted as granted or pre-grant by the reward model.In terms of the relations among tables in this research, each policy model in Table 4 learns from its respective reward model in Table 3.As detailed in section 3.5, each AIPCO dataset in Table 1 serves as the basis for its corresponding SFT model in Table 2 and RW model in Table 3.
For qualitative analysis in the future, readers with an interest can refer to the exemplary patent claims of higher and lower rewards in Appendix D or all generated patent claims at [38].It should be noted that, while the increased ratios in Table 4 is derived from the reward model trained for granted patents, it cannot be inferred that the trained policy model will predominantly produce patent claims with a higher probability of being granted in actual patent prosecution or even hold strong in litigation.The process of patent prosecution is intricate, and patent litigation is even more complex.Hence, the patent claims produced by the policy model and predicted by the reward model as granted in this study should not be mistaken as the actual outcomes that will occur in real patent prosecution.
In fact, after initial (subjective) inspections, it seems that the existence of PPO training and the quality of the generated patent claims do not correlate.For example, in Appendix D.1, the generated text without PPO training is less favorable compared with the generated text after PPO training in Appendix D.2.On the other hand, in Appendix D.3, the text without PPO training is more favorable compared with the generated text after PPO training in Appendix D.4.The crucial factor here is the reward model itself.As a demonstration of the concept, this experiment shows that the policy model in RLHF can effectively align with the reward model for higher rewards.However, the ultimate quality of the generated patent text hinges on the performance of the reward model.To create high-quality patent claims, it is imperative that these claims fulfill the novelty, nonobviousness, and utility requirements stipulated by patent law.Relying solely on granted patents and pre-grant applications falls short in constructing a comprehensive reward model.The complexities associated with training a reward model to address all requirements in patent law will be discussed in section 5.1.

Patent prosecution as an RLHF system
The overarching aim within this realm of research is to envision the entire patent prosecution system as an RLHF system.The ultimate goal is to generate patent claims that meet the legal requirements for allowance by patent offices.From the perspective of RLHF, the human participants in patent prosecution include several roles such as inventors, patent agents, patent attorneys, and patent examiners.The feedback from humans consists of various forms, such as the office actions issued by patent examiners and the revisions made by patent agents, attorneys, or inventors.Reinforcement learning involves the iterative back-and-forth dynamics within patent prosecution, and the objective for patent practitioners is to maximize the probability of patent allowance with preferred patent scope.
As mentioned, to be patentable, at least three major requirements in patent law must be satisfied: novelty, non-obviousness, and utility.The challenges lie in developing corresponding reward models for each of these requirements and acquiring the necessary data to effectively train these models.The reward model in section 4.4 for classifying granted and pre-grant is demonstrative and aims to offer a preliminary assessment of whether the three requirements have been met jointly.In terms of acquiring the necessary data, it will be crucial to consider the references to prior arts that are cited in the office actions issued by patent examiners.By pinpointing and extracting relevant paragraphs from prior art references to use as training data, there is a possibility of creating reward models more capable of evaluating novelty and non-obviousness.However, despite the presence of these prior art references within the records of patent prosecution history, they are not publicly accessible in text form.The obstacle could potentially be overcome in the future if patent offices can release more textual and structural data in patent prosecution for public access.
It is also worth mentioning that, among the three requirements, the utility requirement is likely the most challenging to address.It bears resemblance to the issue of factuality encountered within the realm of large language models in general.Such models have gained notoriety for generating content that lacks grounding or is entirely hallucinated (although hallucination might spark innovative creativity in inventors).In the patent domain, all granted patents and most of pre-grant applications are expected to have met the utility requirement.Consequently, instances where the requirement is unmet essentially do not exist.This scarcity of negative samples creates an insurmountable hurdle, making the training of a classifier for the utility requirement implausible.In conjunction with the challenges posed by issues of grounding and hallucination, the endeavor of training a reward model to evaluate the utility requirement in patent law becomes exceptionally formidable.In the foreseeable future, having human in the loop to assess the utility requirement might be the sole solution.
In brief, the suggestion in this section is to conceptualize the entire patent prosecution system as an RLHF system, wherein human in the loop remains integral for evaluating the utility requirement.As for the novelty and non-obviousness requirements, it is advisable to train either two distinct reward models or a single joint reward model.Similarly, additional requirements of patent law such as the written description requirement, antecedent-basis requirement, means-plus-function structure, and patentable subject matter could potentially be evaluated through reward models, benefiting from the available positive (allowed) and negative (rejected) samples in patent prosecution history.

Future Work
Owing to limited resources, this manuscript is constrained by the absence of comprehensive qualitative and quantitative analyses.Addressing these limitations would require significant efforts from patent practitioners, particularly in reviewing generated patent claims and providing human feedback.Beyond these efforts, and from a broader perspective, there are two potential directions for future research in the intersection of AI and patent law.The first is from the perspective of generative AI, while the second is from the perspective of the patent domain.In terms of the former, the rapid progression of techniques within generative AI over recent years promises an expanded application within the patent domain, particularly in light of the conceptual framework in section 5.1.For instance, it is worth considering whether the newly introduced concept of DPO (Direct Preference Optimization) represents a more efficient paradigm for training language models in contrast to RLHF.The idea pertains to training based on preferences without relying on reinforcement learning.In the event that RLHF continues to stand as the favored approach, the question arises whether other alternatives, such as (Implicit Language Q-Learning) [42], might excel in performance beyond the existing PPO approach.
Regarding the perspective of the patent domain, it is noteworthy to highlight that various conventional tasks will remain highly relevant to the effective generation of patent text.Take, for instance, the longstanding hurdle of prior art search (semantic or keyword-based or both), a challenge that remains unresolved.Training an effective reward model for assessing novelty or non-obviousness necessitates the incorporation of related prior art references into the training data.Lacking a more effective prior art search, the sufficiency of the training data's scope will be compromised.Another example is patent classification.In this context, the granularity of patent classification is not confined to existing classification systems like CPC (Cooperative Patent Classification).With increased granularity, a classifier can refine the focus within a specific technical field, enabling the training data to be more centered around that specific scope.Consequently, model training can center on a more nuanced and precise range of training data.This refined training data is anticipated to bring advantages to models at all stages within RLHF.These are two instances of conventional patent tasks that can be supportive to the core objectives in RLHF, and there likely exist additional such tasks to delve into.In summary, considering both the perspectives of generative AI and the patent domain, there remains considerable uncharted territory holding significant potential for future exploration.

Conclusion
From a broad perspective, this research frames patent prosecution as a reinforcement learning system.The goal of applying reinforcement learning is to make use of human feedback or intent in patent prosecution and increase the chances of generating patent claims that are likely to be granted.Although the human feedback accessible in public and in text format are limited, the experiments conducted as part of this research demonstrate that generative language models can be controlled through reinforcement learning.These models are capable of reflecting the human feedback or intent in patent prosecution.Regarding human feedback, the language models can be trained to align with a reward model that classifies granted patents over pre-grant applications.Regarding human intent, these language models can be trained using different reward functions based on the length of patent claims or the number of limiting terms in patent claims, or a combination of both.While the generated patent text currently falls short of meeting the quality for allowance by patent offices and requires significant improvement, these experiments confirm the viability of applying RLHF to patent text generation.Notably, the standard 3-stage RLHF pipeline has been implemented in the patent domain for the first time.To foster the realization of the ideas presented in this research, the source code and datasets will be made available.This will prove valuable in the future as more human feedback becomes accessible for implementation within patent offices or for broader use in the public domain.
"1.A method for registering an object in three dimensions using machine vision comprising the steps of: at training time, acquiring training images of an object used for training with one or more cameras; at runtime, acquiring runtime images of an object to be registered at runtime with the one or more cameras; and determining a three dimensional pose transformation between the pose of the object used at training time and the pose of the object to be registered at runtime by (a) defining features in each of the runtime images as three− dimensional rays through an origin of each of the one or more camera's, respectively, (b) associating the three− dimensional rays with corresponding runtime features from the training images, and (c) computing an optimal pose estimate which maps the training features onto the corresponding three−dimensional rays of runtime features using iterative, reweighted least squares analysis."], "prompt": [ "<|start of claim|>1.A method for registering an object in three dimensions using machine vision comprising the steps of: at training time, acquiring training images of an object" ], "generated": [ " using machine vision, wherein machine vision has training directions relative to an imaging sensor; combining the training images using a face recognition method to obtain feature points; combining equipartitional information of feature points from different images as feature information; registering said feature information to a model standard using said feature points; and adapting said feature information and said model standard to another system in real−time.<|end of claim|>" ], "rewards": [ 1.9453125 ] }, A.2 max len=512, lower reward (1.04)

{
"doc id": [ "8768795" ], "appl id": [ "13612758" ], "flag patent": [ 1 ], "claim one": [ "1.A system, comprising: a processor configured to execute a utility analytics system, an unbilled utility system, and a revenue recoverability system, wherein: the utility analytics system comprises: a non−technical utility loss system configured to derive a probability of at least one utility non−technical loss; the unbilled utility system is configured to calculate an unbilled utility service for a customer based on customer attributes and the probability of the at least one utility non−technical loss; and the revenue recoverability system is configured to calculate an expected recoverable revenue based on probability of revenue recovery and of the expected unbilled utility service."], "prompt": [ "<|start of claim|>1.A system, comprising: a processor configured to execute a utility analytics system, an unbilled utility system, and a revenue recoverability" ], "generated": [ " system.<|end of claim|>" ], "rewards": The following results are extracted from the file labeled "experiment 1 2.txt" in [38].

{
"doc id": [ "8010356" ], "appl id": [ "11356898" ], "flag patent": [ 1 ], "claim one": [ "1.A computer implemented method of training a model to characterize coarticulation and reduction in a speech signal using a computer with a processor, comprising: calculating, with the processor, a likelihood function for an observation vector derived from cepstral observation data corresponding to speech as an objective function; using a gradient ascent algorithm, using a frame independent approximation that provides a closed form solution, to estimate, with the processor and based on the calculated objective function, covariance matrices for the model, the covariance matrices being estimated independent of a time frame corresponding to the cepstral observation data, wherein the model gives a probability of a vocal tract resonance (VTR) trajectory as a filtered sequence of VTR target frequencies and a probability of the observation vector, given as a function of the VTR trajectory and an acoustic model having cepstral residuals as acoustic parameters, and wherein the covariance matrices are estimated such that at least one covariance matrix is used for generating the probability for the filtered sequence of vocal tract target frequencies and at least a second covariance matrix is used for the acoustic cepstral residuals used by the acoustic model; and outputting with the processor , the model for use in modeling coarticulation and reduction in the speech signal to recognize speech "flag patent": [ 0 ], "claim one": [ "1.A method for extracting a stream graph for a hardware circuit from a program written in a programming language , comprising: receiving a program; analyzing the program , by a processor, for one or more constructs in the programming language that builds the stream graph for programming the hardware circuit, a node in the stream graph representing a computation to be performed, the computation expressed in the programming language, an edge in the stream graph representing data flowing from the node to another node in the stream graph; and outputting one or more artifacts representing the stream graph, wherein the stream graph can be compiled for the hardware circuit to execute and produce a result with different node initialization values and different streaming data values that are presented in an execution of the program not known at compile time, and wherein the stream graph can be built using recursive computation and the connectivity of the stream graph can be arbitrary."], "prompt": [ "<|start of claim|>1.A method for extracting a stream graph for a hardware circuit from a program written in a programming language, comprising: receiving a program; analyzing" ], "generated": [ " the program for a stream graph; identifying a node in the stream graph; iteratively traversing the stream graph so as to connect the identified node to other nodes.<|end of claim|>" ], "rewards": [ 1.1767578125 ] }, " wherein predictors are predictors that are predictors already used by a search criteria, each predictor associated with a data unit having an associated target value, using a model for the search criteria as an expert model, wherein a plurality of training or comparison sets each comprising at least one data unit are used to score the potential match candidate using one of a plurality of non−stop algorithms, wherein one selected of the plurality of non−stop algorithms is selected based on a local measure for indicating the selected non−stop algorithm of the plurality of non−stop algorithms being accurate, the local measure being a function of the target value and the associated data unit; ranking the potential match candidates by function of the confidence of the potential match candidate, wherein the function comprises dividing the potential match candidate from a basis set having a plurality of data units having a score, wherein a plurality of parameters of each data unit in the plurality of data units exceeds an unacceptably high threshold parameter and wherein the unacceptably high threshold parameter comprises a value associated with confidence of the plurality of parameters, and wherein the score for a candidate domain of the potential match candidates is a representation of a significance of the data unit in the candidate domain under a search criteria parameter, wherein the plurality of selected algorithms include the one selected non−stop algorithm and a parameter subsystems algorithm, and wherein the plurality of risk parameters include the estimates for v(x0,x5,x0,x0), and wherein the candidate domain has x0 with a confidence v (x0,x1,x0,x values for x0 over x values for one or more x values distributed over \u221a{square root over (6) and where m is a number of data units having X values, and the score for each data unit increased by one at a boundary when the local measure for indicating all the data units above the maximum possible score is decreased .<|end of claim|>" ], "rewards": [ 12.0 ] }, The following results are extracted from the file labeled "experiment 2 2.txt" in [38].A method for composition of subtitles for audio/video presentations, wherein data streams containing video, audio and subtitle information are transferred from a storage medium, such as a disc, comprising steps of: retrieving from said storage medium subtitle layer data, the subtitle layer data containing graphic subtitle elements; extracting from said retrieved subtitle layer data subtitle display region information for defining a subtitle display region, and a cropping flag, and subtitle display region cropping information for cropping the display region, the subtitle display region cropping information comprising a parameter for horizontal position of the cropped subtitle display region, a parameter for vertical position of the cropped subtitle display region, a parameter for width of the cropped subtitle display region, and a parameter for height of the cropped subtitle display region; and enabling, depending on the value of the cropping flag, automatic cropping of rectangular parts of the subtitle elements to be displayed in the cropped subtitle display region; wherein the cropped parts of the subtitle elements to be displayed are defined by said subtitle display region cropping information " wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein where wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein wherein whereinenabling wherein wherein wherein wherein wherein wherein wherein wherein wherein the wherein wherein wherein wherein whereinATCTG wherein wherein wherein wherein wherein when wherein wherein said wherein wherein said wherein wherein whereinBro wherein wherein wherein wherein wherein wherein: wherein imnamed wherein the wherein identifying wherein the wherein wherein wherein co somewhere wherein wherein wherein pointed wherein wherein formal wherein wherein wherein wherein wherein brands wherein.wherein equ handled wherein wherein non satellite wherein and least wherein the wherein of the wherein wherein wherein sub Wel wherein g wherein<|end of claim|>" ], "rewards": [  "ppo generated": " the navigation system, and at least one control signal generated at a controller when the user has commanded the navigation system to do so, the controller having a processor, a non−persistent input/output data store, and a data input/output port, the at least one control signal comprising at least one logical signal, and the at least one logical signal along with the processor being responsive to the at least one control signal for generating a data input signal, the data input/output port being responsive to the at least one logical signal for receiving and storing a string of logical data for identification as the at least one logical signal, wherein the at least one logical signal essentially enables the processor to distinguish a prior control command as well as a subsequent control command after generating the data input signal, and wherein the prior control command is normally indicative that a stop mode is commanded, and wherein the data input signal is responsive to depressing a particular key on a keypad that is operated by the user while in the stop mode, and wherein the stop mode is a demonstration mode in which commands are input from a user such as via a touch screen and command the navigation system to ignore inputs from the user such as via the keypad, and wherein the non−persistent input/output data store comprises historical systems, at least one ranking, and at least one weight or parameter determinant.<|end of claim|>", "ppo current reward": 0.8808419108390808, D.3 doc id = 5583946, before PPO, favorable outcome "prompt": [ "<|start of claim|>1.A method for recognizing a gesture input on a display screen for a computer system, the method comprising the steps of: (a) receiving" ], "sft generated": " an input gesture sequence from a user, said input gesture sequence identifying the particular gesture of the user, (b) receiving an output gesture sequence from the computer system; determining whether the user is executing a speech recognition sequence on the output gesture sequence, wherein if the user is not executing a speech recognition sequence on the output gesture sequence, then displaying the fact that the user has accessed the display screen; determining, whether the user is executing a pointing device on the display screen initiated by movements of a stylus on the display screen; and initiating a computing−oriented application on the user for updating the set of language− specific words suitable for executing speech recognition on the input gesture sequence, when the input gesture sequence is consistent with the set of language−specific words.<|end of claim|>", "sft current reward": 0.38146066665649414, D.4 doc id = 5583946, after PPO, unfavorable outcome "prompt": [ "<|start of claim|>1.A method for recognizing a gesture input on a display screen for a computer system, the method comprising the steps of: (a) receiving" ], "ppo generated": " a candidate list from the computer system, which contains a plurality of candidate groups each having a barcode, each barcode being disposed such that it corresponds to lines of a two−dimensional screen of the display, each candidate group including, for a group of candidate groups, a group name representing the type of barcode that the group has in that group, and including, for each candidate group, a majority identifier for identifying groups to which the candidate group belongs, and further includes a sub−set identifier for identifying an individual one of the candidate groups; (b) determining whether to display a result for each candidate group; and (c) displaying, for a group−by−group result, the corresponding portion of the screen and presenting a pregained result, if the group−by−group result is frictionally spaced on the screen with respect to other results; whereby undershooting by mistake where the group is selected for presentation almost invariably occurs, eventually falsely producing sway of characters in speech.<|end of claim |>" "ppo current reward": 0.5985496044158936, 2 and 111.4 respectively for patent applications published between 2001 and 2014 that were either abandoned or granted later.This manuscript explores the intriguing research question of how to train the policy model in RLHF to control the length of text generation.In this experiment, the reward function for training is designed as shown in Listing 1: Listing 1: Reward based on claim length# text: generated by PPO Trainer # max len: upper limit for having a positive reward s = text[:max len] if s.find('<|end of claim|>') == −1: reward = 0 else: reward = 1 + len(s) / float(max len)

Table 1 :
Datasets of Patent Claim Ones

Table 3 :
Performance of Reward Models (RM) concept of the reward model by substituting the model with different reward functions.

Table 4 :
Granted Ratios Before & After PPO Training